ILLUSTRATIONS BY JACKIE FERRENTINO
When Supratim Mukherjee noticed the same bacteriophage sequence popping up again and again in hundreds of microbial genomes from a database he was analyzing, he got excited. Mukherjee, a bioinformatician at Lawrence Berkeley National Laboratory, was comparing the metabolic pathways of the microbes, and he began to wonder what the nearly ubiquitous sequence was. “I thought we must have discovered something novel,” he recalls. “This entire bacteriophage genome was intact in all these diverse microbes.”
But when Mukherjee looked up the bacteriophage sequence, he learned that it was the sequence of PhiX, a bacteriophage sold by Illumina as part of the company’s sequencing kits to run alongside a genome of interest. Ironically, PhiX is intended as a quality-control measure, to track error rates on any given sequencing run. But in many hundreds of cases, Mukherjee found, researchers had failed to remove the PhiX sequences from...
Mukherjee’s experience isn’t an isolated one. A recent slew of reports has shown that the contamination of published genomes is more widespread than ever. (See “The Great Big Clean-Up,” The Scientist, September 2015.) How does this happen? And what can you do to ensure that your sequences don’t become part of the problem?
The Scientist asked a handful of researchers to share their experiences with contamination and their best tips to detect or prevent these rogue sequences. Here’s what they said.
Widespread genome contamination
When Mukherjee’s team realized that PhiX contamination was rife in published microbial genomes, the group decided to quantify just how often this happened. Among the 18,000 bacterial and archaeal genomes published in the public Integrated Microbial Genomes database, more than 1,000 are contaminated with PhiX sequences, the researchers reported earlier this year (Standards in Genomic Sciences, 10:18, 2015). Around 10 percent of the contaminated sequences had also been published in peer-reviewed publications.
PhiX contamination is just the tip of the iceberg—and the problem is growing exponentially, says David Lipman, director of the National Center for Biotechnology Information (NCBI), which has been screening sequences submitted to its GenBank database for the past five years. “In 2012, we were detecting contamination in 2 to 3 percent of bacteria and archaea submissions,” Lipman says. “But then it started climbing steadily, and by 2014 it was close to 10 percent. This year, so far, it’s on the order of 23 percent.” The rise, Lipman hypothesizes, is tied to the increase in the number of labs—including many that don’t specialize in genomics—that are now sequencing genomes.
Microbial sequences aren’t the only ones riddled with contamination. Last year, computer scientist William Langdon of University College London discovered that at least 7 percent of the human genomes included in the 1000 Genomes Project were contaminated with mycoplasma genetic material (BioData Mining, 7:3, 2014). (See “Out, Damned Mycoplasma,” The Scientist, December 2013.) So it’s safe to say that if you’re struggling with a contaminated genome, you’re not alone.
Sources of contamination
ILLUSTRATION BY JACKIE FERRENTINOThere are a few ways that contamination happens, says bioinformatician Rob Edwards of San Diego State University. “The first is that someone in the lab can confuse two samples and accidentally mislabel a specimen or a file,” he says. “That’s something you can easily mitigate by having a good lab management system and improving your record keeping.” (See “Lab 2.0,” The Scientist, December 2012.)
Alternatively, contamination can stem from extraneous genetic material that’s been introduced into a sample. Or, if you’re culturing bacteria you collected from the environment, Edwards says, it’s not uncommon for multiple species to appear in a sequencing run, even if you think you only streaked a single culture. Likewise, if you’re sequencing microbes from the human gut, your sample will naturally have human cells. Even the genes in a mitochondrion or chloroplast could be considered contamination if you only want an organism’s nuclear genes. These contaminants can’t be completely avoided, but there are measures that can be taken both to clean up your sample before you start sequencing and to purge your sequencing results of any residual contamination.
Edwards, whose team focuses on metagenome sequencing from environmental samples, says his group often uses size filtering to separate organisms upon starting with an especially mixed bag of viruses and bacteria, as in a test tube of seawater. Or, if the researchers suspect contamination with human DNA, they hybridize their gene samples with human genetic material to remove it, leaving only the microbes.
Similar clean-up approaches can be used when you’re dealing with contamination that comes with the protocols you use, such as sequences from PhiX controls, from primers and adapters added to samples to amplify and sequence genes of interest, and from cloning vectors—the genetic vehicles that allow foreign DNA to be copied by host cells.
Yet another source of contamination is dirty machines that bleed through between experiments, letting genes from the prior sequencing run appear in the next one. Just being aware that this kind of contamination might exist in your experiments can help you choose methods to remove it after sequencing, Edwards says. Or, if it shows up repeatedly, you can try changing protocols or troubleshooting your machine.
ILLUSTRATION BY JACKIE FERRENTINOThere’s no question that getting rid of contaminants as early as possible in your process is ideal—for all sorts of reasons. “Contaminants are bad because they increase the direct costs,” says Dominik Laetsch of the University of Edinburgh. “You literally get fewer nucleotides per dollar” when you spend time processing and analyzing unwanted sequences. But here’s the good news: even if your sequences are filled with PhiX, primers, vectors, and genes from species you didn’t intend, you can get rid of all the evidence before anyone else sees your final genomes.
Laetsch is among those developing tools to help clean up sequence data before analysis. Blobtools-light, the newest version of his software, compares your DNA contigs—the overlapping bits of sequenced DNA that are eventually assembled into a final sequence—with known sequences from NCBI’s databases. Then, it provides an easy-to-interpret visualization of this comparison, lumping together sequences thought to be from similar organisms. “We use this as an initial screening tool,” says Laetsch, who studies the bacteria that inhabit pathogenic nematodes, and so is frequently faced with data sets from multiple species. “We can see right away when there are low coverage contigs that aren’t needed.”
A similar program, called ProDeGe (Protocol for fully automated Decontamination of Genomes), was also described this year (ISME, doi:10.1038/ismej.2015.100, 2015). Like Blobtools, the protocol uses public databases to detect contamination in a genome assembly, then groups contigs into “Clean” or “Contaminant” groups. While Blobtools-light provides a visual charting of sequence groups, ProDeGe spits out lists that users can read through to identify contaminants and determine what they might be. “You don’t have to know a lot about it to use it,” says Mukherjee, who has used the ProDeGe software. “So for biologists who are scared of these kinds of tools because they don’t know how to do bioinformatics, it’s a great solution.”
More-specific tools also exist. NCBI, for instance, offers VecScreen, which quickly identifies contaminating vectors in your sequence. And tools that are even more advanced will be available on the NCBI site later this year, Lipman says.
All the tools available to detect contamination must balance specificity and sensitivity—identifying sequences that are definitely contaminants without removing sequences that are part of the target genome—so it’s important to understand the results in the context of your data set, says Edwards. If you’re analyzing a genome from a novel group of species, for instance, programs may flag high levels of contamination simply because existing databases don’t already contain homologues for the organism’s genes—so take the results with a grain of salt.
Or, if you’re expecting to see high levels of bacterial genome contamination, you can immediately home in on those sequences among the full list of contaminants. Because you know more about the source of your initial sequence and how it has been handled than any automated program, you can make these kinds of judgment calls, Edwards says. “I definitely recommend running your assembly through several different tools and comparing the results.”
ILLUSTRATION BY JACKIE FERRENTINOOnce you’ve pinpointed the contamination—and, hopefully, where it came from—you can move toward cleaning up your data. There are tools to do that, too, including DeconSeq, developed by Edwards’s group. Unlike the more-automated contamination screen programs, DeconSeq requires the user to input contaminant species. Then it automatically removes sequences belonging to those species from the genome assembly.
If you skip the step of contamination removal before you finalize a sequence, someone else might catch it. At NCBI, for instance, Lipman’s group runs a foreign contamination screen on every sequence submitted to GenBank. He hopes that when the screen flags a sequence as contaminated, scientists use it as an opportunity to learn more about their data set and its weaknesses, and change their methods to avoid the problem in the future. “If you just say, ‘OK, I had a little problem in my submission and now I fixed it,’ then this problem is going to keep happening,” Lipman says.
If a genome is already published in the literature or a database by the time you detect contamination—when you’re running more experiments, for instance—then own up and fix the mistake as soon as possible, before it ricochets into the work of other researchers who might be relying on your data for their own experiments. In some cases, this might mean contacting a journal to find out whether a correction or a full retraction is warranted.
“People need to take ownership of their data,” says Mukherjee. “If you find a problem, retract it, clean it up, and then give it back.”
Will the contamination problem get better?
It’s tempting to speculate that as sequencing technologies advance, many of the sources of contamination will simply disappear. There’s some truth to this, says Laetsch. “As the assembly process gets easier, and read lengths start to increase, it will definitely become even easier to pinpoint contamination,” he says. But researchers shouldn’t take that as a signal to stop screening for it, he adds. “Sequencing machines are only as good as the samples you put in them.”
And as databases of genomes become larger, it becomes increasingly difficult to go back and sort through them all to ensure clean sequences. It’s up to individual researchers to do their best with each genome, says Mukherjee. “I think the scientific community in general acknowledges that contamination is a big problem, but there has to be a bigger concerted effort to do something about it.”
Even as contamination rates in GenBank skyrocket, Lipman agrees that there is growing awareness of the problem. Part of what’s leading to the increasing contamination, he points out, “is the positive story that more and more labs are able to do sequencing, which is a wonderful thing.”