Q&A: A New Tool for Ensuring Genetic Privacy
Q&A: A New Tool for Ensuring Genetic Privacy

Q&A: A New Tool for Ensuring Genetic Privacy

Gamze Gursoy and Mark Gerstein of Yale Medical School have developed a strategy for stripping identifying variants from functional genomic data, balancing privacy with utility.

Amanda Heidt
Amanda Heidt
Nov 12, 2020


The appetite for data on individuals’ genetic sequences is growing, both from consumers—the number of testing kits sold by leading companies such as 23andMe and Ancestry.com topped 26 million by the start of 2019—and from scientists looking to amass large datasets for medical research. In addition to whole genome sequencing and genotyping, in which scientists identify a person’s genetic variants, new functional genomics tools such as RNA-seq, ChIP-seq, and single-cell sequencing have led to an exploding number of tests detailing how people (and their individual cells) respond to environmental conditions, medications, or disease. 

courtesy of gamze gursoy

But as more and more people volunteer their information, the seemingly anonymized data that results from such tests is becoming a target for hackers looking to glean sensitive medical information. In particular, the many genetic variants that make each person unique can be used to identify them as surely as a fingerprint, revealing confidential information such as their disease status. In a process known as a linkage attack, a hacker can use known information about a person from sources such as public records or even discarded objects that contain a person’s DNA to identify them within an anonymous database compiled by academic researchers. For example, if someone were to participate anonymously in an AIDS study, it might be possible to uncover their participation—and thus their HIV status—using DNA sequenced from a cigarette.

Bioinformaticians and data scientists are therefore working to develop new ways of storing and analyzing data that protect anonymity while still allowing for the type of collaborative sharing necessary to advance medical science. 

The Scientist spoke to Gamze Gursoy and Mark Gerstein, two bioinformaticians at the Yale School of Medicine, about how sensitive information can be gleaned from genomic data and ways of balancing privacy with utility. Their paper, published today (November 12) in Cell, details a new method for sanitizing, or removing sensitive information from, functional genomics datasets by separating out identifying variants in a way that does not affect data quality.

See “Technique to Track Golden State Killer Suspect Could Find You Too

The Scientist: We hear a lot about genetic privacy in the context of things like 23andMe and Ancestry.com results, but what you’re describing in this paper is a different type of data. What sets functional genomics apart from something like DNA sequencing or genotyping?

Gamze Gursoy: The types of data that 23andMe and Ancestry.com [provide] when people are interested in looking at ancestry or at their predisposition to diseases are the genetic variants you get from DNA sequencing. . . . When it comes to functional genomics, you are doing these experiments to understand the activities in the cell nucleus—if genes are expressed, if transcription factors are binding. These experiments are not necessarily done to identify the genetic variants of the individual. 

Mark Gerstein: You can only sequence your genome once, but you can do essentially an infinite number of functional genomics experiments on one person. Inexorably, those human samples will give you variants from the people that donated those samples, but it’s utterly irrelevant a lot of times to what you’re interested in. 

Bioinformaticians are working to remove sensitive, identifying genetic information from functional genomics datasets to allow for public sharing of data while preserving privacy.
gamze gursoy and charlotte brannon

TS: What can our functional genomic data reveal about us as individuals, and why, therefore, should we be concerned about the privacy of this information?

GG: There are two types of information you can get. Because these are sequences from a sequencer, you can get [some of] the genetic variants of the individual. That’s what we are trying to sanitize, because we don’t need them to calculate gene expression, for example. 

But there is another thing. If you figure out who the person is that you have functional genomics data from, you can learn phenotypic information about them. Because usually, these functional genomics experiments are done for the purpose of understanding if a gene is on or off in a disease. You’re trying to protect the genetic variants so that you cannot reidentify the person, because once you do, you can get some private, sensitive phenotypic information.

MG: The analogy to think about is what has happened with the internet. Initially, people thought it was very innocuous to post pictures on Facebook. Now, there are so many people in the world looking at these things. I really think it’s a very analogous process, because the intent of the biomedical enterprise is to sequence a very large fraction of people’s genomes and to build massive databases. What would be very unfortunate is for people not to think about this stuff ahead of time, for us to build this huge database in the future and find out that it has all these annoying leaks. That would be extremely damaging to biomedical science.

Even though it might seem academic and kind of silly to be thinking about all this stuff now, it is really important to do it before it gets to scale. 

See “Hackers Are Breaking into Medical Databases to Protect Patient Data

courtesy of Mark gerstein

TS: What is a linkage attack, and can you share a real life example?

GG: Let’s say you have two data sets. One of them has information coming from a known individual and the other one is an anonymized data set. In a linkage attack, you use the known information to deanonymize the anonymized data set. 

What we have done, for example, is take coffee cups from an individual and sequence the DNA we found on the coffee cup. . . . We know the owner of the coffee cup, and we have a functional genomics database. We deanonymized the database by linking the genotype that we obtained from the coffee cup to reveal phenotypic information about the owner of the coffee cup.

TS: Do you think that privacy laws are keeping pace with the speed at which we’re developing these genomic tools?

MG: Not exactly. On one hand, there’s people who don’t really understand it at all, who essentially share genetic data without understanding the risk. But I think that’s more of a minority. The mainstream thought process now is that genomic privacy is a big deal. 

But what happens is that everything gets locked down. It’s very hard to aggregate lots of studies . . . to get statistical power to find out important genetic correlations. This thought process, even though we certainly understand it, isn’t really a great thought process for functional genomics data where the point of the data is not the DNA variants. 

The point of our paper is that maybe there’s a different way of thinking about this. You could take the results of the experiments and do this type of sanitization, and then be able to share them in a much freer way. You can sequence someone’s DNA once and have the genotypes and lock them away safely. But then the individual has a lot of tissues, a lot of cells, for functional genomics work.

See “Startups Plan the Health Data Gold Rush

TS: How are you able to sanitize the functional genomics data to shield against these linkage attacks?

GG: These functional genomics data are shared in certain file formats. [During analysis, researchers] take the sequencing reads and they map them to a human reference genome. [The file] tells us where these reads are mapping in the genome, but it also tells us what the sequence of the reads are. 

We look at this data file format, and [when we find a genetic variant in the read], we basically change it to a format where [there’s no] difference between the reference genome and the read. So, if you have a sequence that is different from that reference genome, we [overwrite it with] the reference genome while still preserving where [the sequence] maps in the genome. 

When you have single nucleotide variants where you have a letter change, it’s very easy. [For example], if you’re just changing a letter A to C, the read length doesn’t change, and it’s not going to affect anything. But it gets really complicated when you have deletions and insertions. You could have a region where your genome doesn’t have that region, but the reference genome does. I [would] need to add, for example, a few letters to the end of the read so that it still has the same length.

TS: I was curious if either you have done a consumer genetics test, and if you were concerned  about privacy.

GG: I talked to my family and then I did it. I was a little concerned, but my curiosity was a little bigger than my concern. Even though people don’t trust companies, in terms of the data abuse I think that there are really good data storage systems that these companies are putting in place. Of course, it’s going to be used for commercial purposes. But in terms of tracking me down and violating my privacy, I don’t see as much of a problem.

MG: I haven’t. I’m very much the worrier character.

See “‘Anonymous’ Genomes Identified

Editor’s note: The interview was edited for brevity.

G. Gursoy et al., “Data sanitization to reduce private information leakage from functional genomics,” Celldoi: 10.1016/j.cell.2020.09.036, 2020.