The Great Big Clean-Up

From tossing out cross-contaminated cell lines to flagging genomic misnomers, a push is on to tidy up biomedical research.

Sep 1, 2015
Kerry Grens


Several years ago, a manuscript characterizing a cell line called RGC-5, which was derived from rat retina, came across the desk of Thomas Yorio, then an associate editor at Investigative Ophthalmology and Visual Science (IOVS). The line was commonly used in vision research; Yorio had used it in his own work at the University of North Texas Health Science Center, and researchers across the field had by then published more than 200 studies involving the cells. But the authors of the new paper had found that RGC-5 cells were not retinal ganglion cells after all. RGC-5 cells hadn’t even come from a rat.1 Suddenly, all of those published studies were called into question.

“They were the first to bring it to my attention,” says Yorio, now editor in chief of IOVS. “That got me to say . . . ‘If this is true, we have to identify [this cell line] and how the heck it’s mouse and not rat.’ ”

A University of North Texas lab first derived the RGC-5 cell line in 2001. By the time Yorio read the manuscript questioning the line, the principal investigator of the originating lab had since left for a position at another institution, so Yorio took it upon himself to investigate. He assembled a forensic team to gather notebooks and frozen, early-passage samples of RGC-5 cells and sent the cells to various independent groups for analysis. The results confirmed that RGC-5 was not what everyone thought it was. Rather, it appeared to be a mouse cone photoreceptor line.2 In addition to the fact that the cells came from a different species, the functional differences between the cell types are vast: photoreceptors detect light, while ganglion cells transmit this information to the brain via their axons, which make up the optic nerve.

“In our little community, the RGC-5 story was extremely embarrassing to everyone,” says John Nickerson, an Emory University retina researcher and an editor in chief of Molecular Vision, a journal that had published 18 papers using RGC-5 as an in vitro model. Because so many researchers had used the cell line in their own work, “everyone got burned a little bit,” he adds.

If we’re not using what we think we’re using, we’re not testing our hypotheses. We’re just gumming up the literature. I’m not sure what we’re doing, but that’s not science.

—Jeffrey Boatright, Emory Uni­versity

To deal with the problem, the vision field did something unorthodox. The editors of IOVS, Molecular Vision, and another leading vision journal, Experimental Eye Research, joined forces to blackball RGC-5. In August 2013, the three journals issued new editorial policies that essentially banned publications that used the cell line.

“We declared from that point forward that papers stating they were using RGC-5 would not be accepted,” says Emory University vision researcher Jeffrey Boatright, another editor in chief at Molecular Vision. His journal and IOVS also went further, requiring authors to validate the authenticity of all cell lines used in their work, rather than rely on information from suppliers or on prior characterizations.

Although cell line misidentification has plagued science for more than half a century (see “Seeded by Weeds,” The Scientist, May 2015), the journals’ stance was progressive, reflecting a growing appreciation for the magnitude of the problem and its contribution to irreproducible research. In a recent analysis of preclinical science, Leonard Freedman, president of the Global Biological Standards Institute (GBSI), and a pair of economists estimated that about half of all studies cannot be replicated because of flaws in design, procedure, analysis, or materials—to the tune of about $28 billion.3 And more than a third of those studies could be attributed to “contaminated, mishandled, or mislabeled biological reagents like antibodies or cell lines,” they wrote in their report.

The vision field isn’t the only scientific discipline making efforts to clean up its mess. From authenticating cell lines to identifying mislabeled genomic sequences and developing purer reagents, researchers across the life sciences are engaging in the fight against irreproducible research caused by contaminants. “There are more people involved trying to improve the situation, at all levels: scientists at the bench who are more aware, funders, people trying to do advocacy or build data sets to work out how significant the problem is,” says Amanda Capes-Davis, chair of the International Cell Line Authentication Committee (ICLAC). “I think things are improving.”

But addressing contamination in life science research is a huge undertaking, one that requires changing culture, attitudes, and old habits—a goal perhaps easier said than done. “To this day,” says Boatright, “there are labs still trying to get work published on the assumption [their cells] are RGC-5.”

The extent of the mess

BACTERIAL STOWAWAYS: Mycoplasma contaminate up to one-third of cell cultures and sequences from the bacteria are listed in some genomic databases as human genes.© DON W. FAWCETT/SCIENCE SOURCEFar and away, cell line misidentification has received the lion’s share of attention when it comes to the problem of contamination in life-science research. And for good reason: although misidentification doesn’t always invalidate results, it’s misleading at best, mucking up one-quarter to one-third of research that uses cell lines, according to one estimate.4

But cell line mix-ups are just one cause of science’s irreproducibility problems. Another scourge of cell culture are Mycoplasma bacteria, which taint up to one-third of cell cultures. (See “Out, Damned Mycoplasma,” The Scientist, December 2013.) Mycoplasma contamination is so pervasive that the bacteria’s genes “have managed to jump the silicon barrier and get themselves incorporated into international data banks as human genes,” University College London’s William Langdon wrote in a 2014 study of the 1,000 Genomes Project database. Seven percent of the sequence samples, he found, were from Mycoplasma species rather than Homo sapiens.5

And sequence contamination is hardly limited to mycoplasma. One recent report found that a variety of sequences assigned to particular species—from human to fungus to Tibetan antelope—in fact belonged to Bradyrhizobium bacteria.6 Another study found widespread contamination from human DNA or human-dwelling bacteria.7 (See “Fact or Artifact?,” The Scientist, October 29, 2014.)

Genetic contaminants may be introduced during DNA preparation and sequencing, but some studies have found reagents themselves to be a source of contamination. In a clever bit of detective work in 2013, researchers figured out that what had been reported to be a new virus infecting hepatitis patients was actually viral DNA present in spin columns used to extract nucleic acids, likely originating from the diatoms used to make the silica tubes.8 And in a recent study from Alan Walker’s group at the University of Aberdeen, researchers found that genetic material present in popular DNA extraction kits can swamp out the signal of a sample if starting concentrations of the target sequence are low enough.9

Nikos Kyrpides, head of the Joint Genome Institute’s prokaryote super program, which oversees microbial genomics and metagenomics projects, says next-generation sequencing has been a game changer in understanding contamination. “Quite likely there was always contamination with reagents,” he says. “It’s just with Illumina sequencing we can see it.”

On the other hand, next-generation sequencing has also introduced a lot of low-quality draft sequences that open up the potential for labeling errors, says David Ussery, the comparative genomics group leader in the Bioscience Division of Oak Ridge National Laboratory. And the control samples often used to weed out other types of sequencing errors can themselves contribute to the problem. Earlier this year, Kyrpides and colleagues scanned 18,000 microbial genome sequences stored in the Integrated Microbial Genomes database and found more than 1,000 of them contained the sequence for PhiX, a bacteriophage genome used as a control in Illumina sequencing.10

“Basically, everybody knows that they should be removing those [sequences]. But there are cases where some groups are forgetting this and leaving PhiX in the final data and submitting them to GenBank,” Kyrpides says.

The janitors

The consequences of contamination are far-reaching. While the amount of financial waste remains ill-defined and contentious, everyone agrees that contamination muddies the literature, contributes to the problem of irreproducibility, and can take a personal and professional toll on those who have been affected.

There are a lot of cells being called every which thing.—Howard Soule,
Prostate Cancer Foundation

Years ago, as Capes-Davis was wrapping up her PhD research on developing a new mouse cell line as a thyroid cancer model, her group at the University of Sydney decided to authenticate their cell lines and found evidence of contamination in their samples. “It was at the point where [my] work was completed and already written up,” she recalls. “I thought, ‘I can’t even face the prospect of testing,’ and I just walked away. And all of that material wasn’t used. That was a pretty confronting experience for me as a young scientist.”

She went on to help establish a cell bank in Australia, and came to recognize how widespread cell line contamination really is. To combat the problem, she decided to develop a reference guide that listed known cross-contaminated cell lines and soon came across the University of Glasgow’s Ian Freshney, who was building a similar list. They joined forces, and in 2010 launched a catalog of 355 cell lines for which there were published reports of contamination either in the literature or by cell banks. Currently, the list includes 475 lines, some of which, Capes-Davis has found, are still in use by researchers under their mistaken identity. ICLAC, which Capes-Davis chairs, now curates the list.

Capes-Davis also got involved in developing cell-line validation standards, released by the ATCC Standards Development Organization in 2012. The guidelines recommend, first, consulting the contaminated-cell-line list, and second, sending out samples for short tandem repeat (STR) profiling at the beginning and end of any experiment to confirm—or refute—the cells’ identity.

For years STR has been the go-to method for validation. It measures the number of times a brief stretch of DNA is repeated, generating a profile for that cell line. But this year, Richard Neve of Genentech and his colleagues developed a new resource for validation. The researchers built a list of more than 2,700 STR cell-line identifiers and created single nucleotide polymorphism (SNP) profiles for some 1,000 human cell lines.11 The National Center for Biotechnology Information (NCBI) will maintain the catalog as it develops. “There are a massive number of cell lines in academia that have never been deposited in cell banks,” Neve says. “The idea of this paper is to lay down that foundation of good metrics and quality controls.” Capes-Davis agrees: “It’s a great step forward.”

Some journals are also strengthening their resolve to combat cell-line contaminants at the point of manuscript submission, requiring authors to document the source and validation of their line. At Nature and its sister journals, such an initiative has been in place since 2013, but most authors have not complied. In a recent editorial published in the same issue as Neve’s publication, the editors note: “Out of a sample of around 60 cell-line-based papers published across several Nature journals in the past two years, almost one-quarter did not report the source. Only 10 percent of authors said that they had authenticated the cell line. This is especially problematic given that almost one-third said that they had obtained the cell lines as a gift from another laboratory.” The journals’ editors now plan to be more proactive in asking for authentication data.

At least 30 other journals, including PLOS ONE, the Journal of Molecular Biology, and the Journal of the National Cancer Institute, have instituted similar polices requiring cell-line verification, according to a list compiled by Capes-Davis, but there’s a long way to go, says Neve. “It’s really a drop in the ocean, and it’s really quite worrying.”

Funding organizations and academic institutions have been slower to put such regulations in place, but there are signs of change. The National Institutes of Health (NIH) is planning to add a question to grant proposal requests about how applicants plan to validate materials. “The first reason to do this is really to make both the applicants and the reviewers confront these problems,” says Jon Lorsch, director of the National Institute of General Medical Sciences. Lorsch hopes the question will be enough to get people to comply, but if necessary, validation could be a condition of reward. Still, there’s no formal method for enforcing such a policy. “There is that balance between administrative burden and efficiency,” Lorsch says.

That answer isn’t good enough for some concerned about the growing irreproducibility problem in the life sciences. “It lacks a stick,” Howard Soule, the executive vice president and chief science officer of the Prostate Cancer Foundation, says of the NIH’s position. “We need enforceable standards.” As an example, the Prostate Cancer Foundation last year implemented a policy that requires grantees to provide authentication data to receive a second funding check. Soule says that so far, the plan has worked. Not only have funding recipients complied, but the policy has helped root out contaminants, preventing scientists from wasting resources on a line they weren’t intending to study. “If NIH would do what we do, it would probably [save] hundreds of millions of dollars, maybe billions,” says Soule. “There are a lot of cells being called every which thing.”

For their part, laboratory-supply companies are responding to contamination with new assays to detect contaminants and with ultrapure, DNA-free reagents. Mike Brewer, director of pharmaceutical analytics at Thermo Fisher Scientific, says that his firm is developing a quantitative PCR product that could detect a wide variety of viruses in a sample.

And in terms of identifying mislabeled sequences in genome databases, database managers and users are leading the charge, developing automated sequence-verification methods to efficiently sort through the massive number of genome sequences they host. A German group led by Alexis Stamatakis, a bioinformatician at the Heidelberg Institute for Theoretical Studies, for instance, is developing an automated method to scan a large number of genetic sequences in a given database, identify sequences that seem unlikely to belong to the labeled organisms based on taxonomic information, and flag those possible rogue sequences for manual validation. Meanwhile, Kyrpides’s team at JGI has developed a program called ProDeGe (for Protocol for fully automated Decontamination of Genomes) that automatically removes any sequence foreign to the target organism, with 84 percent accuracy.12

“I have a vision here that over the next few years we have a variety of computational approaches . . . to create curated subsets [of possible contamination] across all of GenBank,” David Lipman, the director of NCBI, told The Scientist in January. (See “Mistaken Identities,” The Scientist, January 1, 2015.)

To address the problem of low-quality draft sequences and the labeling errors they can cause, Oak Ridge’s Ussery has teamed up with Kyrpides as part of a working group called the Genomic Standards Consortium. One of their projects, in collaboration with the NCBI and the European and Japanese genetic databases, is to develop standardized quality scores for sequences, which could stop mislabeled or poorly assembled sequences from muddying subsequent studies. “Just put quality scores in there, and the user can decide, ‘I want everything above this threshold,’” says Ussery.

Lacking consistency

© ISTOCK.COM/LOLONMuch of the contamination problem may stem from the “wild west” nature of basic-science laboratories, says the GBSI’s Freedman. “There are no real, formal—or even informal—best practices in place,” he says. “Not just [for] how to handle cells, but from statistical analysis to the best way to validate an antibody. All of these things have been generally ignored.”

GBSI is one of a number of groups working to establish and disseminate best practices. Freedman says his team’s focus is on authenticating cell lines because it’s a tractable problem with straightforward solutions. The researchers launched a big #authenticate campaign via social media this year, for example, to raise awareness about the issues and remind investigators to check their cells. GBSI is also working to establish best practices for other protocols—from designing experiments to ensuring the quality of reagents—by developing free, online educational modules for researchers. “I think part of the problem is there’s no formal training when it comes to best practices overall,” Freedman says. GBSI has also applied for grant funding from the NIH to design online training courses for grad students and postdocs.

Sharing experiences with protocols and reagents could also help root out contaminants and establish best practices. Although not specifically designed to address contamination, is a new website that allows users to post their protocols or their tweaks on published techniques. The Protocol Exchange, hosted by Nature Protocols, has a similar mission. Users of the exchange can share methods and comment on one another’s techniques in an open-access format. Lenny Teytelman, the cofounder of, says the site can’t prevent problems, but it can help expose them. Often if people uncover issues with a technique or a reagent, he says, “there’s no place to communicate them. They end up in our heads, not on paper.”

Those aiming to clean up science’s contamination problem are hopeful that editorial policies, social media campaigns, and media coverage of irreproducible research will inspire labs to scrutinize their work for contamination. But even after a contaminant has been exposed, labs don’t always clean up the mess they’ve left behind in the literature. Take RGC-5, for instance. A PubMed search reveals just one retraction related to RGC-5—the paper describing the original characterization—and a PubMed search for RGC-5 and “correction” or “erratum” yields no results. Hundreds of studies that used RGC-5 continue to stand as written in the literature, including dozens that have been published since 2013, after the three leading vision journals banned RGC-5–based studies from their pages.

“If we’re not using what we think we’re using, we’re not testing our hypotheses. We’re just gumming up the literature,” says Molecular Vision’s Boatright. “I’m not sure what we’re doing, but that’s not science.”


  1. N.J. Van Bergen et al., “Recharacterization of the RGC-5 retinal ganglion cell line,” IOVS, 50:4267-72, 2009.
  2. R.R. Krishnamoorthy et al., “A forensic path to RGC-5 cell line authentication: Lessons learned,” IOVS, 54:5712-19, 2013.
  3. L.P. Freedman et al., “The economics of reproducibility in preclinical research,” PLOS Biology, doi:10.1371/journal.pbio.1002165, 2015.
  4. P. Hughes, “The costs of using unauthenticated, over-passaged cell lines: how much more data do we need?” BioTechniques, 43:575, 577-78, 581-82, passim, 2007.
  5. W.B. Langdon, “Mycoplasma contamination in the 1000 Genomes Project,” BioData Mining, 7:3, 2014.
  6. M. Laurence et al., “Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes,” PLOS ONE, doi:10.1371/journal.pone.0097876, 2014.
  7. R.W. Lusk, “Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data,” PLOS ONE, doi:10.1371/journal.pone.0110808, 2014.
  8. S.N. Naccache et al., “The perils of pathogen discovery: Origin of a novel parvovirus-like hybrid genome traced to nucleic acid extraction spin columns,” J Virol, doi:10.1128/JVI.02323-13, 2013.
  9. S.J. Salter et al., “Reagent and laboratory contamination can critically impact sequence-based microbiome analyses,” BMC Biology, 12:87, 2014.
  10. S. Mukherjee et al., “Large-scale contamination of microbial isolate genomes by Illumina PhiX control,” Standards in Genomic Sciences, 10:18, 2015.
  11. M. Yu et al., “A resource for cell line authentication, annotation and quality control,” Nature, 520:307-11, 2015.
  12. K. Tennessen et al., “ProDeGe: A computational protocol for fully automated decontamination of genomes,” The ISME Journal, doi:10.1038/ismej.2015.100, 2015.