WIKIMEDIA, GEORGE GASTINData privacy researchers have been able to identify the names of hundreds of participants in the Personal Genome Project (PGP) using demographic data from their profiles, according to a paper out this week on the arXiv preprint server. The authors also suggest ways in which contributors can increase their privacy.
Launched in 2006, the PGP aims to collect genetic data as well as health and lifestyle information from 100,000 people to help researchers tease apart the interactions between genotype, environment, and phenotype. The project does not guarantee privacy, reported MIT Technology Review, and participants can choose to disclose as much personal data as they want, including ZIP code, birth date, and gender, on their online PGP profile. But these profiles are “de-identified,” meaning their names and addresses are not made public.
Now, researchers from Harvard University have demonstrated that this veneer of anonymity is easily breached. By comparing demographic data from 579 PGP profiles containing zip codes, full dates of birth, and genders with information from voter lists and other public records, and identifying patient names in the files they had uploaded to the PGP website, the researchers identified 241 participants. Checking the results with administrators at the PGP, the team found that 84 percent of these matches were correct, demonstrating that PGP profiles are vulnerable to re-identification.
This could be harmful because many participants reveal sensitive personal details, argued the authors of the study, such as predispositions to genetic diseases that might affect life insurance premiums and claims. The 2008 Genetic Information Non-Discrimination Act does covers medical, but not life insurance.
The researchers added that privacy protection could easily be firmed up with little impact on research value if PGP participants included less precise birth date and ZIP code information. They have also developed an editing tool to help people make such changes to their PGP profiles, which cannot otherwise be modified.
Clarification (May 3): The text has been amended to more accurately reflect that a portion of the 241 participants “re-identified” were found using names included in the files they had uploaded to the PGP website. As Jane Yakowitz Bambauer, associate professor of law at the University of Arizona, pointed out on the Info/Law blog, 115 of the 241 were "re-identified" in this way, and 80 of those 115 could not have been found using their demographic data alone. Thus, using demographic data alone, the researchers could only have re-identified 161 of the 579 participants.