Advertisement

Anonymity Under Threat

Scientists uncover the identities of anonymous DNA donors using freely available web searches.

By | January 17, 2013

Wikimedia, Silky MA person donating their DNA sequence anonymously for research purposes may in fact be identified by a few simple web searches, according to a paper published today (January 17) in Science. But rather than trying to protect anonymity, some scientists believe efforts should instead be focused on educating DNA donors and on legislating against the misuse of sequence data.

“The paper is a nice example of how simple it is to re-identify de-identified samples and that the reliance on de-identification as the mechanism of ensuring privacy and avoiding misuse is one that is not viable,” said Nita Farahany, a professor of law and research at Duke University in Durham, North Carolina, who was not involved in the study.

Participants in public sequencing projects are told that their anonymity is not 100 percent guaranteed, but the risk of a person’s identity being discovered was perceived to be miniscule, explained Yaniv Erlich, a computational geneticist at the Whitehead Institute for Biomedical Research in Cambridge, Massachusetts, who led the study. However, a 2005 Washington Post article about a teenage boy who tracked-down his biological sperm-donor father via online genealogy searches suggested the risk may be significant. According to the article, the boy had submitted a sample of his own DNA to a genealogy service that used repeat sequences from his Y-chromosome to search their sequence databases for related males. Although the search did not uncover his father directly, it did find weak matches to two men who importantly shared a surname.  Along with his father’s place and date of birth—information released to the mother—the likely surname enabled the boy to find and contact his father.

“We heard about this story and we thought, wow, this could be a threat for [the privacy of] personal genomes,” said Erlich.

To see how easy it might be to discover the identity of DNA donors, his team built software for retrieving Y-chromosome repeat information from whole genome sequences. With those repeat sequences, they could perform genealogy searches. “We thought, cool, let’s try it on the genome of Craig Venter,” said Erlich. “And it worked!”

They searched the available genealogical sequence database at Ysearch.org and, sure enough, the strongest match by far was to someone named Venter from Lincolnshire in England. The surname, together with Craig Venter’s known age and state of residence—two pieces of information commonly accompanying anonymous genome sequences—were then used to search the online public record, USsearch.com. The search came up with just two possible people, and one was Craig Venter.

Taking the experiment further, Erlich and his colleagues used their software to retrieve Y chromosome information from the anonymous DNA sequences of male participants in a public sequencing project and showed that, using the same methods, they could accurately determine the identities of multiple individuals. They could even identify anonymous women donors related to the males, by virtue of family tree data accompanying the genome sequences and the ability to search online public records. The important point, said Erlich, is that “everything was publically available. We didn’t break into any database. We didn’t need any special passwords.”

Although, the authors find the probability of discovering someone’s identity is still low, the study raises the question of whether more should be done to protect donors’ anonymity. But George Church, professor of genetics at Harvard Medical School, who was not involved in the study, thinks there is little point. “You can keep trying to adjust the protocols”—information about participants’ ages might be kept private, for example—“but that’s kind of putting a bandage on it. . . . It’s only going to get easier to re-identify [anonymous sequences], not harder,” he said.

Although the Genetic Information Nondiscrimination Act in the United States prohibits employers and health insurance companies from discriminating on the basis of genetic information, “there is still a fear of the unknown,” said Brad Malin, a professor of biomedical informatics and computer science at Vanderbilt University in Nashville, Tennessee, who is worried that the study will frighten away members of the public from participating in genome sequencing projects. “It is important to highlight these problems, but at the same time, when you highlight them it is very difficult to temper the result,” he said.

Farahany agreed. “What we need to do is better educate people about the facts,” she said. Furthermore, she added, efforts might be better spent on regulating the use of sequence data, rather than ensuring anonymity. “That’s where we should focus our legal and ethical analyses,” she said—“not on trying to prevent the flow of information, but on trying to prevent the misuse of information.”

M. Gymrek et al., “Identifying personal genomes by surname inference,” Science, 339: 321-324, 2013.

Advertisement
Keystone Symposia
Keystone Symposia

Add a Comment

Avatar of: You

You

Processing...
Processing...

Sign In with your LabX Media Group Passport to leave a comment

Not a member? Register Now!

LabX Media Group Passport Logo

Comments

Avatar of: Taxpayer

Taxpayer

Posts: 10

January 18, 2013

I hope scientists and the public don't overreact to this. There's a lot of personal and financial information on the internet that can cause greater harm if it is misused. Restricting access can also slow research or mislead its direction if others cannot re-analyze the data easily. We also know that epigenetics and gene modifiers also affect phenotype, so just knowing DNA sequence is not enough. Also, if data are inaccessible, it is difficult to correct inaccurate information. No one seems to know how much big BAD data are out there.

Avatar of: Dora Smith

Dora Smith

Posts: 6

January 18, 2013

I read this in some detail.  The news stories about it are completely misleading.

I initially thought this was something a New York Times article started.  The most astounding thing about it is that this is something a group of University geneticists did.  A team of them, consisting, to all appearances, of some of the most sanctimonious people on the planet, but led by someone who looks on his university web page photo to be possibly too smart alecky to take the matter seriously or be expected to make responsible decisons, wrote three, count them three, highly repetitive articles in the current journal of Science, demanding that those of us who have voluntarily made our genetic genealogical data public in order to build knowledge, and geneticists who have made data from samples voluntarily contributed by research subjects who wanted to advance knowledge, BOTH stop publicly sharing our data immediately, because we are ignorantly violating our own privacy. 

In other words, everyone but the authors of the three papers are such lowly and ignorant creatures, we really can't have thought the matter out for ourselves and be perfectly satisfied with our own decisions.   Anyway,we're such low, small, ignorant people, we are desperately in need of intellectually and morally superior (in their own minds) academicians to come in and tell us what to think and what we can do with our own personal information.

The extreme urgency of getting everyone to stop doing genetic genealogy immediately, caused Science to make the articles available for free, in the public interest.

Their method is frankly something you'd expect only of TV or newspaper reporters. Frankly I've never seen anyone else act this way before.  I've seen intellectual snobbishness, and sanctimoniousness, but this severely takes the cake.  Usually it's reporters who just barge right in wherever they want to go, trample other peoples' rights, and use the media to sanctimoniously pronounce on how other people should go about their lives, with no respect for anybody.

What the scientists actually did was, focusing on the set of the data that came entirely from people living in Utah, they compared Y DNA STR data from the Thousand Genomes data and compare it to  Y DNA STR data in the public Y DNA databases, such as YSearch, SMGF, and Ancestry, in order to try to determine what surname the Y DNA data belongs to.  The researchers tried to identify matching surnames. The scientists selected two distinctive, unusual surnames, that would not be too hard to track down in a finite geographical area like Utah.

It's important to realize that not everyone would know how to find and how to extract this data from the raw data, nor have the time or money to carry it out.  For instance, your average crook certainly couldn't do it, and it would be extremely expensive for any corporation that might have nefarious motives to do on a major scale.  Those with the resources to do it, like insurance companies, wouldn't do it because not only is it very expensive but genetic information isn't actually a reliable way to predict what health problems people will have.  Family and personal medical history, weight and lifestyle factors are far better predictors of future health problems.  In fact, nobody at all has ever tried anything like this until this group of people now, and they specifically did it because they are riding a hobby horse. 

The papers take up extensive space with why one should expect to identify individual DNA donors by this method - to the exclusion of substantive matters like the ethics of the researchers' own methodology or of wanting to dictate what other people can do with their own personal data.  

I suspect that the researchers must have extracted and compared Y DNA STR data on more than two Thousand Genomes participants, because they couldn't know in advance which samples would turn out to match the Y DNA of people with distinctive surnames that there wouldn't be a large number of in the phone directories. 

THEN, they got on the phone.  They called up everyone in Utah with those surnames, on the phone.   Presumably they called them at home while they were eating dinner and doing whatever evening business people do with their families.  They asked them if they were who submitted samples to the Thousand Genomes Project.  Whether because they didn't see any reason to keep it a secret, or because they felt intimidated, or because they weren't fast enough on their feet to think to tell the callers to jump in the lake, two people admitted to being who submitted the samples.   Then the callers lit into them for being irresponsible with their personal genetic information, because they need to become educated and learn to stop violating their own privacy. 

It has come up that there is a question about how he actually confirmed he'd found the right people, but he claims he not only identified specific indiduals, but identifed all of their first degree relatives as well, starting with USSearch.com lists of people with those surnames who live in Utah.  If all he had done is come up with lists of people with the same surname, he wouldn't have found 80% of the Venters in Utah who have cell or internet phone, and wouldn't have proved the person who submitted the same to the Thousand Genomes Project was named Venter, and he certainly didn't identify the individual along with every one of said individual's first degree relatives, which he said he did, for all five people, and not only the Venter, who he actually knew who it was going into it. 

The stated purpose of the three highly repetitive papers in Science is to scare the rest of us into becoming educated, and no longer violating our own privacy, by sharing our genetic data on public databases.   

WHAT?!  Excuse me, who is violating whose privacy?  Moreover, if people actually voluntarily violate their own privacy, how is that anyone else's business?  The arrogance of this group of scientists simply passes all belief or understanding.  If they think they're so intellectually and morally superior, their feelings about genetic genealogy, and the history it builds, and their utter contempt for and disrespect of other people, show them to be deeply ignorant people.   

Genetic genealogy, and genetic anthropology, build knowlege, about family history, about historical movements of people, and about long ago cultures and movements of people, from the material that is in those public databases.   People contribute their personal genetic data to them because without that data being there, freely available for anyone to search, they themselves could not find the information and the contributors of the information to even put together their own family history.   While Family Tree DNA, a private database, carefully makes it possible to search for only very narrow matches to your own personal data, public databases make it possible to do broader searches in order to find and put together the history of larger related groups of people, such as haplogroup subclades.   This makes it possible to trace your own ancestors' history far back in time, and to trace the history of broad groups of people.   I am currently working on the history of my brother's newly uncovered small clade, which as a matter of fact emerged directly as the result of the work of the Thousand Genomes Project in identifying new Y DNA SNPs.   This broke up haplogroup I1, opened up the history of haplogroup I1, and and enabled the members of my brother's cluster to fall out of a sea of very similar haplotypes, most of which aren't related to him since 5,000 years ago.  

People should also realize that all of the public databases make some effort to protect the data against actual abuse, such as data mining.  Ancestry will only let you search for matches to whatever data you currently have hand entered as your own, though you can change it.  SMGF will only allow about six searches of its database a day from teh same user name or the same computer.  Y Search is designed to be used by researches among other things, but every step you do you have to enter a captcha, which drives everyone bonkers. 

Yaniv Ehrlick said he was quite startled to learn that it was possible to do the project he just did.  I have to wonder if he's somehow so spoiled, and so sheltered, in his ivory tower, that he doesn't know what nearly everyone except me who has reacted to his escapade all day unconcernedly said all day.  It has been true for a long time, that anyone who really wants to, can very easily learn anything at all about you.  It has been true for as long as the Internet has existed, and probably for far longer.  The ability to look up people who share a surname in the phone book, identify their ex-spouse, their parents, their siblings, and their children, by name, call them up, and bother them, has nothing whatever to do with their DNA.  If you research someone's phone number, you probably won't actually end up with an up to date phone number, but you will quickly learn the names, the location, and teh ages, of their spouse, their former spouses, their parents, their siblings, and all of their grown children, and any housemate they've had for the past ten years or so.  If you want to learn more you can readily do so for $15 to $60.  People who regularly pry into other peoples' lives have subscriptions to services that provide the information, from public records.  Once a piece of e-mail I got that had a whole bunch of sensitive personal information, possibly including even part of a credit card number, convinced me my bank account, or a merchant where I do business online, had been hacked.  The bank officer changed my debit card, but told me that all of the information they provided me, including my birth date and my social security number, is public information, and very easy for anyone who wants it to get. This is the real world.  We all live with it.   It's impossible to have common sense and not think there are really much greater privacy-related threats to our welfare out there then letting other people in on our Y DNA.  I guard my social security number, my date of birth, my address, and my phone number.   I'm not worried about what my genetic information is going to allow a crook to do to me.  Erlich cannot cause me, or anyone else, to start worrying about it by telling us he is superior and we are ignorant. Not that I'm aware of when  bullying tactics and talking down to people ever convinced anyone of anything. 

One thing that is not as easy to get as Yaniv Ehrlick makes it out to be, is someone's phone number.  The majority of people in the United States use cell phones and internet phones that cannot be listed.  People also tend to change their phone numbers often.  If you look up a phone number online, you may learn alot of unexpected information about that person and their family, but there's only a small chance that you'll learn the correct phone number.  That means that it is not possible to use Yaniv Erlich's method to locate most Americans. 

I tried to get my own phone number from USSearch.com, and it isn't there.  They also scrambled me together with several other people, though I picked myself out of a list, based on full name, age, and city, and purchased a report specifically on that person.  If he went by that, Ehrlich would have been looking for me in the wrong city, and the family tree he'd have constructed - well, I'd like to have seen him do it. 

What is more, in his arrogance, Yaniv Ehrlich assumes that becasue we make public information he wishes we didn't, we don't protect information we deem to need protecting.  No fewer than three people today suggested to a third persion that usually if somebody whose business it isn't requires your phone number, address, or birth date, they have no way to force you to actually provide the correct information.  I want people to be able to see my genetic data and to share theirs with me.   I do not want people to come to my house, so I make sure they don't know where it is. In fact, the age USSearch gave me is consistent with the fake date of birth I usually provide online.  Yesterday I wrote to someone about their Y Search data, and I got a very detailed and forthcoming answer, from Caspar Ghost.

People have been saying online all day, that fusses like this one, and actual cases of abuse of teh genealogical databases that create stir and often change minor details about how the databases are managed, happen from time to time.  One person said that Yaniv Ehrlich never even intended his efforts to be taken as seriously as I take them.  Maybe it's me; I'm a highly moral person who only says things when I mean them, and only makes threats when I mean to carry them out.  Some think that Ehrlich Yanov did and said all of this for no other reason than to impress other academicians.  He doesn't actually want to eliminate genetic genealogy.  Supporting such an idea, he had his admin reply to me that he doesn't intend to do such a thing.  I wrote back, what else is it logically possible he wants?   He says that the public natures of the databases threatens our welfare, and genetic genealogy cannot exist without that.  He also finds it very important to educate us about the danger.  What else can we do about it but stop doing genetic genealogy because it is worthless, dangerous, and wrong?   Either at the least he's fine with that outcome, or he has started a war to mess up our lives, just to impress his fellow academics.  It must be said that in his online photo, Yaniv Ehrlich does come across as someone who may be too spoiled and lacking in character, and frankly smart alecky, for his word to mean anything at all.  Most members of the genealogical community, including leadership often inclined to see it more his way, seem to think it doesn't mean a whole lot, and that the appropriate response is to yawn or snort and go on keeping on.  

Unfortunately, I think the truth is more likely that he doesn't dare to straightforwardly propose to eliminate teh genetic databases at this time.  The whole thing looks like a carefully orchesterated coordinated attack, designed to use the media acting like itself, to build momentum, until there are Congressional hearings and oppressive state regulations to squash genetic genealogy.   There aer three articles, each authored by a different set of people. The third article states that the authors of the first two are right and the matter is gravely and urgently serious.  The editors of Science state that they made the papers available for free on account of the urgent seriousness of the situation.   Meanwhile the media is in it.  They are getting quotes from more scientists who were clearly found and referred by the ones who wrote the articles, who reiterate what a serious thing this is and how badly we need to shut down those databases, because people are violating their own privacy by giving out information taht could be used to identify them.

What people who take the matter seriously most need to understand is that a fundamental difference in values underlies this controversy.  Most people who contribute DNA samples for complete genome sequencing, and who put their genetic information in public databases, are far more interested in gaining knowledge than we are possessed by obsessive fears about our privacy.  

Wanting to collaborate with distant relatives one will never meet in order to know one's history is often impossible to understand, for people capable only of sanctimoniousness, excessive nervousness, smug arrogance, and scaring other people about whatever.   The same goes for the potential for such research to shed light on history.    For instance, my brother's unusual Y DNA cluster may shed light on the movement by Romans of Germanic people from the mdidle Rhine to Britain, it's already helping to open up the early history of haplogroup I1,  and it certainly made it possible to learn that my paternal line came from southeastern Scotland.  None of that would have been possible without the public databases; both the data, and the contact information of the owners of the data.  And, this wouldn't have been possible without that complete anathema of invasion of privacy, the Thousand Genomes Project.

However, the quest for knowledge does not make sense to everyone.  There are many people to whom it makes no sense at all.  Unfortunately, bullying, arrogance and ignorance often go hand in hand.   Ehrlich is bullying people with his sanctimoniousness and his telling people they have to be educated by someone who is above them to do things his way.  Both the bullying and the attitude toward genetic genealogy betray deep ignorance.

Some people who do genetic genealogy are also other-oriented. It's sometimes called being a Christian.  We share data that can benefit other people because we think that's the right thing to do.   Every scrap of genealogical data I have, including photos of those who are no longer living, is online, so that my younger relatives will never again have the kind of trouble I had learning my family history; even what my grandparents looked like.

It is common for people to fail to realize that individual people do not own family history.  The entire family group owns its history.  The information is public by nature, and it belongs in the public domain, where people can find it.  Our genetic data is part of that history.  

I'm not very worried about genetic discrimination, for several reasons.  Employers have very straightforward ways of avoiding having sick employees cost them money.  Large numbers of them simply fire you if you get sick.  Health and life insurance companies don't spend the money to gather and go through your DNA, partly because your personal and family health history and your lifestyle are far better predictors of your future health.  I want an employer I am going to get along with, and work for for a long time.  Employers who screen employees on criteria other than their ability to do the job, and especially those that invade peoples' privacy to do so, are as likely to reject prospective employees who have genealogical web sites betraying them to be "very deep".  People doing genetic genealogy have better sense than to worry about it.  If Ehrlich had ever in his life faced real world problems of finding and keeping a job, he wouldn't have tried that one.  One time my cousin got turned down for a job for being a woman, and her father's reaction was, you didn't want to work for them!  

People who are worried about keeping their family health history under wraps could do that most effectively by providing complete and truthful information about it to other family members.  Everything that runs in my family is online; no young relative of mine will ever again not know until it's too late that bipolar disorder runs in the family.  I thought I'd run out of patience then, until at a support group meeting I learned that someone's mother hid her diagnosis of type 2 diabetes from him out of shame, until after he developed teh condition, and it was preventable.  What's more, evidently this level of silliness is common.  Don't, even, start with me about keeping genetic data secret.  

How do Yaniv Erlich and company DARE to assume that I've not thought through what personal information I want to make public, that I'm not satisfied with consequences of my own actions, or that I don't protect personal information I want to keep private.   

There seems to be a secondary issue, of it being possible to identify people who donated samples to public projects, like the Thousand Genomes Project, to be identified by comparing it to our data in the public databases.  First of all, that isn't my problem.  We have the right to do whatever we want with our own personal genetic data, and if someone else is afraid it might be used to identify other people in their own database, they are who will have to adapt.  I personally hope they keep the data public, and find research subjects who simply aren't as worried about the matter as they are.  They will still meet their quotas of subjects; those full genome projects are always completely overwhelmed with people who want their genomes sequenced without having to pay thousands of dollars.  I myself have tried to get in on them repeatedly.   Many of us really don't care who has our data and would happily identify ourselves.  Why if the Thousand Genomes Project wants to toss the data of someone who is upset that someone might be able to identify them, and take me instead, I'll be overjoyed.  If they took my brother, I'd be even happier.  Unfortunately for me, I seriously doubt anybody is going to toss anything. 

Frankly, the Thousand Genomes Project has been a complete wonder of public access.  All of us have marvelled at it.  Most DNA research, and most scientific research, is not done like this, and it does not have to be done like this.  Previously one could usually easily get access to data collected by someone else to do one's own research, but one had to go through channels to get access to it. 

Second, I bet most people in the Thousand Genomes Project donated their DNA out of concern for knowledge, not out of concern for anonymity, and just aren't as worried about it as some of the researchers are.  They are specifically worried because they were conditioned in school, using Pavlovian tactics, to believe God would strike them with a bolt of lightening if confidentiality were ever breached.  It is a GRAVELY SERIOUS MATTER.  In other words, this whole thing is about nervous sanctimonious bureaucrats in a tizzy. 

So the Thousand Genomes Project can either test people on the terms that they could be identified, or go back to the old way of making their data available.  Neither is hard. 

If I found something of this sort, I'd really have gone to the people in charge of the Thousand Genomes Project, not published an article in Science and tried to shut down genetic genealogy, or "educate" teh lowly about our need to not violate our own privacy.  The people in charge of the Thousand Genomes Project as as Pavlovianly ethical as any other research scientists, and they're obligated to deal with it.  This tactic has among other things grossly slowed down that process by sending everyone into a tizzy and forcing them to focus on other things.  The choice to do something different says that Yaniv Ehrlich does NOT want to solve a routine problem, he wants to start something.  Hopefully his efforts will backfire on his own career.  The course of action he has taken could well cost people on that project their jobs.  Pavlovian or not, his colleagues are not likely to take kindly to the irresponsible tactic - unless, the point is really something else that has broader support, like end genetic genealogy, or, more likely, end citizen science.

The main difference between the way the Thousand Genomes Project is run and the way other big and important datasets are administered, is that you and I have access to the data.  Scientists employed by universites and research institutes would have access to it under both systems.  Genetic genealogy is also citizen science; one of the biggest citizen science projects ever done.  Academicians have always been deeply uneasy with citizen science.  They want to control science.  They want to subject new knowledge to their own process of review.  I think most of all they want to preserve their own sense of being special people.  This furor is taking the form of an urgent need for people who do genetic genealogy to learn our place, listen to our elders and betters, become educated, and stop being ignorant.  I've never actually seen us as a threat to the scientific method, or to the scientific establishment; we depend on Family Tree DNA, whose scientists rigorously apply scientific method, and we feed data to them as much as they provide it to us.   But university researchers are often egotistical, and the most frightening, bullying monster you'll ever meet is academicians claiming you're guilty of an ethical violation.  Your priorities no longer matter, and quite possibly, neither do the facts.  

In fact, it would be very hard for anyone to actually stop genetic genealogical information from being made public in organized ways; it could just be driven from for instance Family Tree DNA's servers, and become more scattered. People doing family and relative finder testing have organized their data on large public sites all by themselves, adn if they had to, people would pretty much set up Y Search all over again. In fact, Y DNA information is already made public in very widely scattered locations on the Internet; for instance Y Search and the DNA projects that are hosted at Family Tree DNA, the World Families site, and on peoples' personal web sites.  The scattered data is actually collected and researchable as matches to a particular registered haplotype, on a Russian web site, that Ehrlich and his camp followers can't get their sanctimonious hands on, that crawls a number of the scattered data sources.  Ehrlich would know this if he knew much at all about working with Y DNA, instead of sticking his nose here and there to see what might be wrong with teh world that he can fix today.  I got some of the most important pieces of information about the members of my brother's haplotype from the SEMARGLE site.  If not for this service I wouldn't know his group came from the Middle Rhine, because one of the two people with teh Y DNA that traces to that location has only made it public on an obscure Jewish DNA project. 

I could have found the information in Family Tree DNA's own database, and probably find more information as well, if it allowed people to choose to display matches to our own Y DNA, at our choice of a genetic distance, instead of theirs, and if we could see the markers of the haplotypes we match - atleast at the option of the person whose data it is.  People have to opt to appear in other peoples's search results, and they should also be able to opt to display their markers.  Currently Y Search is the only way to show someone else, from a possible match to a genealogical genealogy list, your markers, and it shouldn't be that way.

As far as trying to identify people in the Thousand Genomes Project, I myself tried to uncover the identities of three people with a newly found unusual Y DNA SNP in the Thousand Genomes database. I myself found this far more bother than I was prepared to put into it.  I specifically wanted to ask the people for more information on where their male lineages had lived, to research the history of the SNP.  I also wanted to look into how come 3 of 15 non-Norse haplogroup I1 men in teh Thousand Genomes Sample were Z131+, while the SNP has proven extremely rare in the real world; much more rare than simply being DF29-, which is unusual.  I didn't find that the people in charge of the database were irreponsible, and the confidential parts of their data were not easier to get than they should have been.

I also wanted to check their STR data,  Strange, possibly wrongly counted, STR data had been extracted by someone else for one of the three people.  I wanted to check it, and then get the same data for the other two people.  Since most people who do Y DNA testing don't test SNPs, and noone is going to have tested for a newly discovered SNP, one has to find out what STR markers approximate particular SNPs in order to use the databases to track that group.  

Not everyone who might want to do much at all with that data has the skill set.  It would have taken me months of hard work just to learn how to extract the STR data I wanted from the complete human genome raw data; presumably an endless stream of four letters.  I also found it impossible to get straight answers out of the project staff on what dataset I should order and how to go about order it.  That could not be more inconsistent with the notion that they aren't protecting the data from people who might abuse it.  I outright got the feeling that they found something wrong with me, and, not being the person of complete nerve and endless arrogance that Yaniv Erlich is, this bothered me.   To plow right ahead with this project took the amount of nerve and insensitivity I'd expect more of a news reporter.    

Maybe I could have also have found their Y DNA STR data in public databases, and learned who they were.  Perhaps one reason that never occurred to me is that it's unlikely that they'd have unusual enough names to make that possible.  Our academicians selected lineages that matched surnames they COULD identify for the sake of this scary demonstration.  What is more, in one case they started with a specific individual they could already identify, and worked backwards in order to demonstrate teh alleged ease of working forwards - in the authors' imagination.   

It's also not likely enough that I'd find matches in the databases to strike me as worth the effort.  On the one hand, my three people would be much more likley to be found in the databases if they wanted to be.  People who volunteer for genetic research are likely to more than usually interested in it, and they may have also had Y DNA testing done and put the data in the public databases.  People in Utah are often Mormon, often really want people doing genealogical research to find them by any method, and often contribute to public databases such as SMGF in the hope of making genealogical connections.

On the other hand, most Y DNA lineages are not in any database; partly because most of them haven't been Y DNA tested, and partly because many people don't make their data public; it drives the rest of us crazy.  What they often do is participate in public databases just long enough to gather everyone else's data, but they don't share their own.  Many of them don't know how to add their data to the public databases.  Not a few are satisfied to be included in their own testing company's private database. That suits them, but it renders their data unavailable to the rest of us.  It is an amazement; my own brother in law's father, "Theophilus McKinstry", would only agree to Y DNA testing on the grounds that noone would ever find him.   I knew better than to argue - just made sure that people find me instead, and Theophilus is hardly really his name.  Outside of North America few people have done any sort of genetic testing, and two of the three people I wanted to identify were in Britain.  The Thousand Genomes Project was worldwide, and Yaniv Erlich restricted his little demonstration to people identified in the data as consisting of a part of the dataset that were all from Utah, because that was who he would find the resources to possibly identify them. 

Many people get Y DNA testing done and find they have no genealogically meaningful matches; it's a common complaint on the dna newbies list.  Both my brother and my brother in law initially had no genealogically meaningful matches at all, in any database.  My brother in law is an example of what Yaniv Erlich would like us to think all surnames are; an unusual and distinctive surname (more or less), with a known origin and a single Y DNA lineage. That was easily confirmed by rounding up more McKinstrys and testing them. There are also distinctive medieval surnames, like Carruthers and Hamilton, with two to five dominant Y DNA lineages together with large numbers of haplotypes that don't match anybody else.  Even McKintrys have had nonpaternity events, and one of them is in the SMGF database. 18th century Scottish bmd records make it clear that we should expect there to be more nonmatches, as atleast two McKinstry women in Scotland in that time had children out of wedlock and given their mothers' surname.  By the 18th century, most McKinstrys had gone to Ireland and many were moving on to points west.  The Hamilton surname originated five or six times, and many people who adopted their surname lived on their vast estates in both England and Scotland.  One of the large aristocratic Hamilton families had an early nonpaternity event and two main Y DNA lineages.  The result is a half dozen major Hamilton haplotypes, and a large multitude of individuals named Hamilton who have their own haplotyes not shared by anyone else, or belong to small family groupings.  The Carruthers surname, which originated in a feudal warlord family and the village on his estate on the coast of Dumfriesshire, consists of two Norse I1 Y DNA lineages, and a host of Y DNA lineages shared by only a few people. 

With my brother, whose surname is Smith, it took several years to get a single close nonsurname match. The second major break was (choke) the Y DNA SNP's produced by the Thousand Genomes Project.  All of haplogroup I are endlessly grateful to it, particularly haplogroup I1.  Most of haplogroup I1 have the SNP DF29, and my brother's actual matches do not. Usually one can tell the difference at 67 markers, but not less, and only with Family Tree DNA's choice of markers.  Few people test for 67 markers.

At 25 markers, my brother exactly matches the great Anglo-Saxon generic modal haplotype.  Yaniv Erlich probably has no idea what that is nor what its implications are for his strategy of how to experimentally invade peoples' privacy.  I suspect that Yanov Ehrlich is not familiar enough with genetic genealogy to be familiar with its most common problems. Haplogroup I1 is 25% of the men of northwestern Europe, and most of its members are more or less close to a handful of similar haplotypes.  And though my brother and his matches, who don't have the 4500 year old SNP DF29, should have alot of genetic distance from most of haplogroup I1, wwho do have DF29 (which we know since the Thousand Genomes Project), clearly some of them, including my brother's cluster, are very close to the Anglo-Saxon generic modal haplotype. I suspect that there are also some that just happen to resemble the Norse I1 modal haplotype.  That kind of convergence is usually caused by what you call genetic drift. 

At 37 markers, my brother has a genetic distance of five from the anglo-saxon generic modal haplotype.   He and his close nonsurname match both upgraded to 67 markers before anybody believed this match was for real.  It turns out an entire cluster of identifiably single haplotypes within a genetic distance of ten of my brother are actual matches who belong to a definite cluster who all test negative for DF29, but you have to see 67 markers to be able to know which ones are really a match.  The most important problems with taht are that most people who test at Family Tree DNA don't test for 67 markers, and that the other testing companies don't test enough or the right selection of markers. 

Senor Erlich would call Smiths living anywhere who match my brother at 25 and 37 markers up on the phone until the cows came home before he ever successfully found my brother.  He clearly doesn't know as much about genetic genealogy as he wants us to think he does.

McKinstrys also illustrate a number of typical problems.  This family group is 600 years old, and belongs to the old I2b1 (M223) Isles Scottish haplotype.  They originated in a densely populated small area in southwestern Scotland.  Genetic distance between people named McKinstry is as much as 6, consistent with the age of the family group.  One problem is that there are other families with surnames that were found in the same area as teh McKinstry point of origin, that also have a genetic distance of 6 from members of the McKinstry family group.   Another problem is that among those people it isn't as easy as you'd think to discern who is named McKinstry.  The actual name was MacAnastrigh (son of a wanderer), and it was corrupted a number of ways, such as Kingstree.  One family in the McKinstry DNA project, named I think McKeen, are not in fact McKinstry, but it was not possible to discern that without an extended period of observation.  They consistently get a large number of matches, which McKinstrys don't, and none of the matches ever match McKinstrys. It's a different family with a different history.  There is no way that this project staff could have looked at them one time and correctly judged what to make of them.  If McNeights and McArtors with similar haplotypes were related to teh project, they might prove to in fact be just a little bit more distantly related to McKinstrys than McKinstrys are to each other.  They missed being McKinstrys by only a couple of generations, or maybe they are descended from the Wanderer's other sons, who from the name may not even have been born to the same woman.  The main thing is that men not all that distantly related to each other don't all have the same surname.  I've found the surname projects full of people with other surnames who were evidently admitted because they might be matches, and from the name and the Y DNA it isn't clear why they are there.  There are also older theories of McKinstry family origins.  During the 19th century, one McKinstry family adopted a Kinsey family crest. 

Our sanctimoniously worried academicians make the matter of identifying someone through these databases look rather more straightforward than it is.   If someone on a mission that probably has nothing to do with you or your data, ever calls you up on the phone, asking if you're the data daddy, it isn't necessarily true that he can even associate your surname with the data.

When people think of being identified in genetic databases, they tend to think of exact matches like the way O.J. Simpson was identified at trial.  An exact match in Y DNA databases would have to come from someone with whom one shares a common ancestor within 100 to 400 years.  Further, Ancestry and SMGF do not have the right selection of markers to make an exact match in many cases, and many people who test at Family Tree DNA don't test enough STR markers to know much at all beyond their major haplogroup.   To the degree that people can find meaningful exact matches to themselves in teh databases that specifically identify those closely related to them, exact matches could be anyone related from 100 to 400 years ago and also wouldn't be alot of people, so usually people don't find exact matches in teh databases.   In genetic genealogy, close matches are also meaningful.

But the article doesn't claim that they found exact genetic matches for people; it claims that they found surnames with sufficiently close haplotypes that they felt confident the donor of the sample to teh Thousand Genomes Project would have the same surname.  We'll get to how likely the underlying assumptions are to be correct.  They are more likely to identify the correct family gruop in the database than to make an exact match, simply because they're very unlikely to make an exact match.  

Also, they searched one their subjects by working backwards.  There was a Mr. Venter, whose contribution to this or another genetic database they had already identified, and they worked backwards to find a match in teh public Y DNA databases and "prove" they could have identified him if they hadn't known who he was.  Maybe tehy could have and maybe they couldn't; Y Search has been down and it's not been possible to check.

Since Y Search is down I can't check to see what would be there that would match Venter how.   However, I can tell you what would happen if you'd tried to match my brother's Smith line in Y Search.  Just so he can check, my brother's Y Search ID is GH4TY.   My brother and his Y DNA cluster are a very good example, because every one of Yaniv Ehrlich's assumptions that he put all that mathematics into "proving" falls apart. 

If you went looking for my brother, and used his Y DNA STR's to try to identify him, here is what you would run into.   

At Y Search, you'd find one Smith, and one Gower, with a genetic distance of 2 at 67 markers. If all you had to go on was his Y DNA, so far you wouldn't know if you were looking for a Smith or a Gower.  You'd also learn my emigrant Smith settled in Chester County, Pennsylvania.  Suppose you settled on the name Smith.  Start phoning people in the phone book in Chester County, Pennsylvania, trying to find the Smith you think submitted a sample to the Thousand Genomes Project, now.  The Gower is from Nashville, which has many Gowers, who all belong to a single male lineage - and they may or may not all belong to Jim Gower's Y DNA lineage. The Gowers' paper trail leads to Worcestershire, England, while my Smith line were Presbyterian weavers from Ireland who landed at Newark, Delaware in the 1790's.  No other member of this large old Gower family group has ever been Y DNA tested, and I couldn't talk any of them into it.  Importantly, the Smiths appeared to be Scotch Irish.  

Look now at all the 5 and 6 off matches at 67 markers.   That is three surnames that belong to two family groups.  One of them is the son of a Murray who wasn't married to his mother, so his name isn't Murray.   In addition to belonging to the same cluster and being fairly close in genetic distance, both families share with my brother and Jim Gower one STR marker value that is unique in non-Norse haplogroup I1.   They also share the unusual lack of the basal SNP DF29, that most haplogroup I1 people have.   One of these families is Scotch-Irish, and the other lived in a village on the north bank of the Firth of Forth in 1600.  This tells us that my brother's lineage came from Scotland.  This means that Jim Gower has Smith line Y DNA - if it is even Smith, because no other Smith matches it.    You all dig where this is going now?   It gets better.

Look now at the 9 and 10 off matches.  In Y Search, you have to compare on 37 markers and look for three and four off matches to see them. The unique of the Scottish families is only found in Scotland; however, several marker values only visible at Family Tree DNA at 67 markers, as well as DF29 testing, tell us which ones belong to my brother's cluster.  Two of them are in or trace to the mddle Rhine, which may not mean they originated there in the 3rd century, but it probably does.  One group of families stands out; it's the group of eight surnames with one haplotype, seven of which have a genetic distance from each other of 0 at 67 markers.  Chuckle. (All but the one who can be traced to Worcesershire,  England, which tells us that is the general area they came from, lived in one county in Maryland in the 18th century, and two of them were inlaws.  OK, now, WHICH of the eight families would you start calling up on teh phone to see if they donated a sample to the Thousand Genomes Project?  

So much for using Y Search to identify someone so he can call everyone with their surname up on the phone and ask if it's them.  

No matter what scary things might be possible in this world, ONLY Yaniv Erlich has ever actually used two sets of databases and then agressively calling people at home on the phone, to identify participants in a research study.  I think the only thing his project speaks to is the moral character of Yaniv Erlich.

Follow The Scientist

icon-facebook icon-linkedin icon-twitter icon-vimeo icon-youtube
Advertisement

Stay Connected with The Scientist

  • icon-facebook The Scientist Magazine
  • icon-facebook The Scientist Careers
  • icon-facebook Neuroscience Research Techniques
  • icon-facebook Genetic Research Techniques
  • icon-facebook Cell Culture Techniques
  • icon-facebook Microbiology and Immunology
  • icon-facebook Cancer Research and Technology
  • icon-facebook Stem Cell and Regenerative Science
Advertisement
ProteinSimple
ProteinSimple
Advertisement
NeuroScientistNews
NeuroScientistNews