FLICKR, KEVIN MACKENZIE, UNIVERSITY OF ABERDEENResearchers at King’s College London were working on some human gene expression experiments in 2008 when they got a strong match to one of the probe sequences in an Affymetrix microarray. The only available information on the gene from the chip was that it was a human sequence, recalled William Langdon, who helped on the project. So the team did a BLAST search to look up more information. “And the first thing you get back is, of course, the human sequence itself,” said Langdon, who is now at University College London. But when he scanned down the list of the other related sequences that appeared in the search, it was apparent something was amiss. “They [were] all different species of Mycoplasma.”
It appeared a case of mistaken identity; the original submitter of the sequence to GenBank must have had Mycoplasma contamination in a human sample, and assumed the sequence was human. In a study Langdon and colleagues published in 2009, the authors show the striking resembling between this “human” sequence and a particular marker sequence from various Mycoplasma species.
To this day, the sequence is still labeled “Homo sapiens unknown” in the National Center for Biotechnology Information (NCBI) database Genbank. This misnomer represents one of the hundreds—perhaps thousands—of sequences deposited to GenBank and elsewhere that have been assigned to the wrong taxon.
That errors exist in GenBank and other databases is a truism. But correcting mislabeled sequences is a difficult task, one that database stewards and computer scientists are now trying to automate. “I have a vision here that over the next few years we have a variety of computational approaches . . . to create curated subsets across all of GenBank,” said David Lipman, the director of the NCBI.
There are a number of reasons a researcher might assign a sequence to the wrong organism, including microbial contamination in samples, chimerism (when the genomes of two organisms combine during the DNA amplification process), poor taxonomic identification, or even simple human mix-ups during sample preparation.
The extent of the mislabeled sequence problem remains a matter of speculation, but a few studies have lent some insight. Earlier this year, for instance, Langdon searched a subset of data from the 1,000 Genomes Project for possible contamination. “About 7 percent of samples have Mycoplasma contamination,” he said.
Another study this year found Bradyrhizobium as a common sequence contaminant in eukaryotic sequences. For instance, sequences assigned to taxa as diverse as a Tibetan antelope, a fungus, a protozoa, and Homo sapiens are all Bradyrhizobium. “The problem is much more extensive,” Martin Laurence, the founder of ShipShaw labs who led the study, told The Scientist in an e-mail. “I have a long, unpublished list of contaminated sequences, since the DNA extraction kits I use are also contaminated, so I end up seeing a zoo of animals in my human clinical species (parrot sequences are particularly popular),” he continued. “Obviously, there were no parrots or Tibetan antelopes anywhere near my samples.”
Evolutionary biologist Stephen Smith at the University of Michigan builds large phylogenetic trees of plants. In one project, on a group of plants including cacti and carnivorous species, Smith analyzed about 4,000 organisms that had enough overlapping sequences in GenBank to make a tree. “Something on the order of 1 to 2 percent of what I used to build this tree is mislabeled,” he said. “It’s not a big number, but if you care where species fall within the phylogeny, it does make it a big deal.”
While it may be apparent that a sequence is mislabeled in GenBank, only the person who submitted the errant entry can correct it. While there are procedures to alert the database administrators to problems, it’s a laborious task for them to contact the submitters and investigate each case. Mislabeled submissions are sometimes corrected, but often they remain in the database.
Alexis Stamatakis, a bioinformatician at the Heidelberg Institute for Theoretical Studies in Germany, is used to complaints from his biologist colleagues about mislabeled sequences. A few years ago, he decided to do something about the issue. He and his group members have developed an algorithm to root out mislabeled sequences. “Right now the method is not fully automatic,” he said. “We have a half-automatic method to facilitate the curation process that will then provide a list of putative mislabeled sequences to the curator.” It is the user’s job to decide whether the sequence does in fact belong to a different organism.
The developers have not yet published their algorithm, but Pelin Yilmaz, a postdoc at the Max Planck Institute for Marine Microbiology in Bremen, Germany, has taken it for a test drive. She is a member of the SILVA database, a curated collection of ribosomal RNA sequence data. Every month she gets a handful of questions from users asking about potentially mislabeled sequences. She applied Stamatakis’s software to a group of organisms consisting of only cyanobacteria. Using taxonomy from GenBank, “out of 1,000 [sequences] I found 150 mislabeled, which is not that bad,” she said. Two other datasets, Greengenes and the Ribosomal Database Project, each showed up with 90 potentially mislabeled sequences, while the SILVA taxonomy had 30.
“It would have been really hard to find mislabels like this,” Yilmaz said. “If I had to do it manually I suppose I would have to build phylogenetic trees over and over again. This is much better.”
The success for the algorithm starts to break down at the species level, but at genus it’s quite accurate, identifying mislabeled sequences with up to 98 percent precision, said Alexey Kozlov, a graduate student in Stamatakis’s lab. At present, the program can handle about 10,000 sequences, so it’s best applied to smaller datasets. Kozlov said scaling up the number of sequences is a future goal.
Meanwhile, NCBI is making some efforts to clean up misidentified sequences in GenBank. The agency has been working internally and with outside groups to develop a curated set of 16S sequences linked to type strains and of internal transcribed spacer (ITS) sequences—another widely used marker—in fungi. “Those are particularly important sequences to curate and get cleaned-up sets because they’re used by so many to classify their organisms,” said Lipman.
Lipman said he’s pleased to learn of developers like Stamatakis who are working to automate the process of scrubbing genetic databases. He’d like to see such tools applied across GenBank, particularly at the point of submission. “So largely it means rather than the database looking at each record as it comes in at the back end, then having to get back to the submitter, if we get these consensus models ahead of time . . . ultimately, you can see how this would save us a lot of time.”
It’s especially important for GenBank to prioritize such efforts given how researchers now use the database, he added. “It has to do with this transition that sequencing is now done for comparative purposes, therefore, we should be doing a good job to clean it up and so we can very rapidly give a much more informative response to a user.”