© PGIAM/GETTY IMAGES
The news continues to bring unprecedented revelations describing the US government’s machinations to mine personal information and snoop on enemies and allies alike. In June it was the ongoing National Security Agency leak saga, spurred by the release of protected federal surveillance information by former defense contractor Edward Snowden.
Other news stories further suggest that the government is trawling more than our personal communications. Cables uncovered by WikiLeaks indicate that Big Brother’s interests include exploring the DNA of foreign diplomats and officials.
But it’s not just the government compiling databases of genetic information. With the precipitous drop in DNA sequencing costs, entire human genomes can now be deciphered for around a millionth of the price 10 years ago. Altogether, the personal genomics industry, grassroots patient projects, and academic research efforts will end up putting hundreds of thousands of genetic sequences online—and soon.
It’s known that each...
For example, DNA extracted from bits of sloughed-off hair or skin could be used to follow a person’s movements, to reveal evidence of stigmatized medical conditions or illegitimate children, or even to plant an incriminating and/or synthesized DNA sample at a crime scene. While these scenarios likely represent the far limits of current technology, as more governments and corporations gain the technical know-how to perform large-scale personal-information mining, we should carefully consider the consequences of making large amounts of data, particularly genomic data, universally available.
Of course, big data isn’t just for spying. It’s also crucial for the future of medicine, and especially for translating genomics research. Thus, potential abuses notwithstanding, we should promote the accumulation of vast collections of DNA as powerful tools to combat disease. To this end, the public needs not only to be assured that threats of government exploitation are kept in check, but that the more pedestrian concerns of leaks to insurance providers, employers, or even friends are also prevented.
The yin and yang of genomic data access are exemplified by the National Institutes of Health’s August announcement regarding access to the sequenced HeLa genome. Henrietta Lacks, the progenitor of the famed quasi-eponymous HeLa cell line, was a poor African American woman whose cervical cancer tumor cells live on 60 years after she died from the disease. Lacks did not provide consent for any of the hundreds of thousands of experiments that have been conducted using cell lines derived from the original HeLa line. When the HeLa genome sequence was published without consent from the Lacks family, the story made front-page news.
Ethical issues aside, the HeLa genome is an incredibly useful tool for the biomedical community: having access to the sequence helps researchers better interpret experiments carried out using HeLa cells. Still, the availability of this genome partially exposes close relatives within the Lacks family to an invasion of their privacy.
NIH’s solution balanced the desires to make the information available for biomedical research and to protect it to a reasonable degree. The data, kept in a protected environment, will be made available to researchers whose applications to use it have been approved by a data-access committee. In some respects, this was a landmark decision. However, it is not clear that this type of solution scales to the thousands, even millions of genomes that must be analyzed to substantiate statistically sound biomedical research.
What can we do?
Technological solutions such as anonymization and encryption are unlikely to work on their own. To date, biomedical researchers have been greatly stymied by the time-consuming and technically difficult tasks of de-identifying and encrypting terabytes of genomic data. Moreover, in the race between overbearing, research-stifling encryption tactics and hackers, technical solutions inevitably become de facto challenges that the latter predictably overcome.
We envision a hybrid social-technological solution wherein codes of conduct, regulatory oversight, and punitive threats that keep data-mining corporate organizations in line could be combined with technical approaches for use in genomics research.
For example, a nongovernmental agency overseeing a limited-access, cloud-based database could be incentivized to protect our genomic data. Such an agency could store most of the genomic information for biomedical research in an extraterritorial cloud repository assembled with consent from the global scientific community and ad hoc standards committees. Researchers seeking access to the data would contribute funds to support this entity, and be bound by the repository’s rules and standards. These regulations would contractually supersede many of the weaker genomic data privacy protections in place across the vastly different local jurisdictions. Individual researchers would be granted a personal license to access this information, which would depend on continuing education.
A cloud-based system would enable all of the information in the repository to be maintained in a standardized format so that researchers could develop their own analytical programs and move them up to the cloud to scale to large volumes of data. Integral to the cloud proposal is the idea that a fraction of the data would be made freely available by genomic “test pilots,” who would bear the risk of making their personal information public. This public information would be the basis for “stub” data sets, which researchers could use to benchmark and develop their programs. (See “Data Drive”)
Further, the cloud could be set up in a way that would restrict the outbound flow of data and log all use of secure data sets. And if a researcher did violate the privacy of the consented genomes, he or she would be punished. Just as the threat of losing their license often prevents less-scrupulous attorneys from violating client confidentiality, a licensing system for genomics researchers could provide meaningful penalties to prevent intentional leaks.
We need to move quickly to implement practical solutions for genomic privacy. The ethical issues are clear, the medical benefits of DNA data mining are real, and most importantly, more genomic data are being produced every day.