These San Diego Scientists Can Predict How You Look Using Only Your Anonymous DNA
Monday, September 4, 2017
Can your personal genetic data be kept truly private? J. Craig Venter and his team of scientists in La Jolla are out with a new study that offers a word of warning.
If you have ever provided your DNA for medical research or as part of a consumer test such as 23andMe, you have probably been assured that your personal genetic information will be kept private.
But human genome sequencing pioneer J. Craig Venter and his team of scientists in La Jolla are out with a new study that offers a word of warning: computer algorithms can now predict what you look like using only your anonymous genome sequence.
"This is the first study combining multiple trait predictions to do identification," Venter said. "I think people need to really be cautious. If someone is promising their data can be de-identified, and there's no security on the database, they're getting false promises."
For a paper published Monday in the Proceedings of the National Academy of Sciences, Venter and his colleagues at Human Longevity Inc. sequenced the genomes of more than 1,000 research volunteers in San Diego.
The subjects came from ethnically diverse backgrounds. They allowed researchers to capture 3D images of their faces, record their voices and take measurements of other physical traits like their height, weight and eye color.
Through machine learning, Venter and his colleagues trained computer algorithms to generate predictions of specific physical traits based only on raw genomic data. Eventually, the algorithms were able to produce digital facial images that, in many cases, turned out to be strikingly similar to the images drawn from a participant's actual face.
"When you have the whole genome, we can predict a photograph, your height, your weight, your eye color," said Venter. He noted that the facial predictions can capture fine details like the shape of chins and noses, as well as overall face shape.
Because the images predict a person's appearance in early adulthood, they may look a bit off depending on the person's actual age. Venter, 70, said when one of the early computer-generated predictions was shown to his wife, Heather Kowalski, she was able to quickly recognize him in the image.
But, "she said it either looked like a younger version of me or a picture of my son," Venter said.
The predictions were not perfect. For instance, they could not reliably predict whether or not a person was bald.
In diverse groups of 10 people, the algorithms were able to correctly identify at least eight people. But their accuracy went down when trying to identify individuals within groups of people who all shared either European or African American ancestry.
Still, the predictive power in this early research led scientists not involved in the study to raise questions about privacy in an age when more people are having their genetic data analyzed and stored in purportedly anonymous databases.
"The work shows a realistic risk that seemingly 'anonymous' genomes might lead to re-identification," said UC San Diego computer scientist Xiaoqian Jiang.
XiaoFeng Wang, an Indiana University Bloomington informatics professor, agreed that the privacy implications of this work are significant.
"No longer can one assume that exposure of her genome sequence will not be directly linked back to her," Wang said. "She needs to realize that sharing her genomic data is just like sharing her picture, with increasingly high resolution."
When asked if people who have purchased direct-to-consumers tests from companies like 23andMe should worry about the privacy of their genetic information, Venter said yes.
"We can generate photos from the 23andMe data," he said.
A 23andMe spokeswoman said the company uses a "custom-designed privacy and security program to protect the unique set of information associated with our service.”
"We closely follow both theoretical research and operational updates regarding new and innovative uses of genetic data, including studies like this one," she wrote in an email to KPBS.
Venter said he has never understood the concept of de-identifying a person's genomic data by simply removing references to information like their name or Social Security number. A genome contains everything needed to identify a person, he said.
In the study, Venter and his colleagues advocate for using cutting-edge computer science, including strong encryption, to keep this kind of data secure. But they said these solutions are being adopted slowly across the booming field of genomics.
Venter also hopes to make these predictive images useful in solving crimes.
For instance, he said DNA samples from rape kits could perhaps one day help better identify rapists. Or, DNA in unidentified human remains could be used to create images that investigators could then cross reference with photos of missing persons.
"We want to improve them so they get to the level where a court would accept them as evidence," he said.
It is now too late for Venter to safeguard his own genome. Back when he became the first human to have their genome sequenced, he posted the data online in the interest of advancing research.
"If I was advising me 15 years ago, I'd say you might want to be very cautious with that," Venter said.
To view PDF documents, Download Acrobat Reader.