The nationwide experiment will initially include around 100,000 volunteers.
The use of underperforming computational tools is a major offender in science’s reproducibility crisis—and there’s growing momentum to avoid it.
January 9, 2018|
In the paper, published last October, researchers from the GTEx consortium had analyzed RNA sequencing (RNA-seq) data from more than 40 tissue types in the human body. The findings themselves were exciting, says Pachter, a computational biologist at Caltech. But a single line, tucked away in the methods section, left him feeling exasperated. The line read: “RNA-seq reads were aligned to the human genome . . . using TopHat (v1.4).”
In response, Pachter took to Twitter. “Please stop using Tophat,” he wrote in early December. “There is no reason to use it anymore.”
TopHat version 1.4 was a 2012 update to an open-source program conceived by Pachter and his colleagues in 2008 that aligns reads from RNA-seq experiments to a reference genome. Not only is version 1.4 far from the most recent version of TopHat—there have been more than 15 releases since—but TopHat itself has been overtaken by newer software, including HISAT, HISAT2, and STAR, developed primarily by other researchers.
“The original TopHat program is very far out of date, not just in time, but in performance—it’s really been superseded,” Pachter tells The Scientist. “By now, in 2017, certainly a high-profile consortium with interesting data oughtn’t be using this tool.”
Kristin Ardlie, director of the GTEx Laboratory Data Analysis and Coordination Center at the Broad Institute, notes that the group does pay careful attention to its choice of tool, but that there are inevitable delays given the project’s scale.
“Getting consortium papers written and to a publication endpoint can take a long time,” she writes in an email to The Scientist. The data for the October publications were finalized in 2014, and made public in 2015. “The original analyses of that would have been performed months before that time,” she adds. (TopHat2, TopHat’s immediate predecessor, became available in 2012.) “We do consider [TopHat v1.4] out of date (or that there are better versions available), and we have indeed updated our tools many times since.” More recent GTEx projects use STAR.
But Pachter points out that GTEx isn’t the only group putting out papers citing obsolete versions of the software. Since its 2009 publication, the original TopHat paper, coauthored by Pachter, his graduate student Cole Trapnell, and Trapnell’s coadvisor, Steven Salzberg, has racked up more than 6,500 citations—of which more than 1,000 were logged in the last year.
It sends the message that it doesn’t really matter what program you use, that they’re all similar—and that’s not really the case.—Lior Pachter,
And TopHat is just one of many out-of-date computational tools to have become embedded as bad scientific habits. Indeed, anecdotal evidence, as well as recent research into the issue, suggest that the use of obsolete software is widespread in the biological sciences community, and rarely even recognized as a problem.
“Quite often, we’ve encountered students or faculty who have been unconsciously using these outdated software tools,” says Jüri Reimand, a computational cancer biologist at the University of Toronto. Asked why they haven’t considered updating their workflows, “they usually answer because they were first familiarized with those tools and they didn’t really pay attention to whether they were updated frequently.”
There’s now growing momentum to counter this attitude, as it becomes increasingly obvious that the choice of computational software can have a substantial influence on the progress of science. Not only do users of older methods fail to take advantage of faster and more-accurate algorithms, improved data sets, and tweaks and fixes that avoid bugs in earlier versions, they also contribute to a reproducibility crisis due to differences in the results new and old methods produce.
From that perspective, “when users are using very old tools that we really know are not the right thing to use, it in a sense devalues the contributions of all of us developing new methodology,” says Pachter. “It sends the message that it doesn’t really matter what program you use, that they’re all similar—and that’s not really the case.”
The last few years have seen a handful of efforts to quantify the effect of using outdated computational tools on biological research. In 2016, Reimand and his colleagues explored 25 web-based pathway enrichment tools—programs that help researchers tap into online databases to make sense of experimental genetic data. The team wanted to know whether updates to these databases and the software being used to access them were making their way into the literature, and whether those changes had an effect on scientific results.
It is not the effect of people just taking a long time to publish results.—Jüri Reimand,
University of Toronto
Their findings were damning. In a letter to the editor published in Nature Methods, the researchers wrote that “the use of outdated resources has strongly affected practical genomic analysis and recent literature: 67% of ∼3,900 publications we surveyed in 2015 referenced outdated software that captured only 26% of biological processes and pathways identified using current resources.”
The main culprit in that statistic was a popular gene annotation software called DAVID, which, in 2015, had not been revised since 2010 (although it has since been updated). Despite its failure to discover nearly three-quarters of the information revealed using more-recent alternatives, DAVID had made it into more than 2,500 publications—many of which must have used the tool when it was already substantially out of date and superseded by other available tools, Reimand notes. “It is not the effect of people just taking a long time to publish results.”
Even when a single tool is regularly updated, the research community may significantly lag behind, as highlighted by a 2017 study by University of Pennsylvania pharmacologist and computational biologist Casey Greene and his former graduate student, Brett Beaulieu-Jones, in Nature Biotechnology.
The duo focused on just one tool: BrainArray Custom CDF, an online resource developed in 2005 consisting of various files that aid gene-expression experiments by matching DNA probes to genes. Combing through the 100 most recent publications that employed the tool, now in its 22nd version, Greene and Beaulieu-Jones found that more than half omitted which version the authors used altogether, making these studies’ findings essentially unreproducible. The remaining papers, which were published between 2014 and 2016, cited nine different versions, ranging from 6 to 19.
When the researchers applied several recent BrainArray Custom CDF versions to a gene-expression data set—obtained from human cell lines engineered to lack particular T-cell proteins—they found multiple discrepancies in the results. For example, while versions 18 and 19 both identified a total of around 220 genes showing significantly altered expression compared to controls, 10 genes that were identified using version 18 were omitted by version 19, and a further 15 genes that were identified using version 19 were missed by version 18.
“It’s making a difference at the margins,” says Greene. “If one of those is your favorite gene, it might change your interpretation.”
Studies such as Greene’s and Reimand’s are a reminder that “there’s a difference between software and experimental protocol,” Pachter says. “Changes in computer science are very rapid—the pace of change and nature of change is just very different than it is for experimental protocol.”
But getting that message to researchers is not so simple, he adds. While some responders to Pachter’s December tweet suggested simply removing old tools or old versions of a software online—in order to, at the very least, prevent new downloads of obsolete tools—there are good reasons to retain a record of the computational dinosaurs online. “There is an argument—and it’s an important one—that people may want to reproduce old results or have the ability to run the software as it was at the time,” Pachter says.
Publishers of scientific literature may also help increase awareness.
Reimand agrees that reproducibility is a key reason to keep good records of older tools. “There should be a version available of the same software that allows you to go back to, say, six months from now, and say, ‘This is how I got the results back then,’” he notes. Many sites now do this: the BrainArray website, for example, currently hosts all 22 of its versions for download—although at the time of Greene’s 2017 study, at least five versions were unavailable.
Some developers instead opt for warning notices on the websites where software is available to download. On TopHat’s homepage, a notice below the description panel reads: “Please note that TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality . . . in a more accurate and much more efficient way.” (Emphasis TopHat’s.)
Pachter suggests that old versions of software could also be modified by developers to include their own warnings, “so that when you download the tool, and you go and actually run it, then the program itself outputs a message and says, ‘You can use this, but there are newer and better tools.’”
On the flipside, the publishers of scientific literature itself may also help increase awareness around the role of computational tools by requiring greater transparency about software information. A number of heavyweight publishing companies such as Elsevier, Spring Nature, and AAAS have adopted publishing guidelines aimed at improving reproducibility, many of which take into account the software problem.
“Including all the information, dependencies, configuration variables, test data, and other items necessary to repeat an analysis is really just part of the larger reproducibility picture, which Elsevier strongly supports,” writes William Gunn, director of scholarly communications at Elsevier, in an email to The Scientist. For example, one set of guidelines known as STAR methods—introduced by Cell Press in 2016 and now being expanded across Elsevier journals—“requires a description of the software, which includes version information, and a link to get it, unless it’s provided as a supplementary file,” adds Gunn.
While initiatives like these might raise awareness of the risks of using outdated software, there are also moves in the biological sciences community to make the whole issue of updating computational tools—as well as switching between tools and various versions—a whole lot easier.
One possible solution, Greene notes, is for researchers to adopt the practice of uploading their entire computing environment with their publications, so that analyses can be run with any and all versions of a tool as they become available. “As a version changes, you could run the analysis with both versions through that software and quickly look at the difference in the results,” says Greene, whose Nature Biotechnology paper outlined how such a system could work in detail.
This sort of a dynamic approach to software is widely used in computer science, but remains a relatively novel concept among biologists. Nevertheless, as Nature reported earlier this year, some researchers see the transition to an era in which “scientists will no longer have to worry about downloading and configuring software” as only years away.
Until then, Pachter has advice for other tool developers. “Do as I’ve done, on Twitter and elsewhere, in public talks and statements,” he says. “Make a point of taking the time to tell people, ‘I have this tool, it’s very popular. Please don’t use it anymore.’”
January 10, 2018
The article suggests that data reproducibility is hampered by failing to keep up-to-date with rapidly changing versions of bioinformatic software. But of course, even if researchers use software that was published yesterday, in a year or two that software, too, will be out-of-date and obsolete, and publications based on it will be mired in the historical amber of their computational time-slice. That is an inevitability of rapid technological change, and users can hardly be blamed for what appears to be monetarily-driven planned obsolescence of their analytical tools.
It is important to distinguish between analytical power and analytical error. If the old software does not find as many contigs, or whatever, as the new software, sobeit. Science progresses. New microscopes allow us to see finer details than old microscopes. But if the older software makes actual mistakes, then that is a different issue.
If there is a real problem orf reproducibility here, it seems to be that bioinformatic tools have been foisted upon researchers while still laden with errors, bugs and flaws. I doubt that inference of homology, which is what sequence aligmnent and other annotation processes is all really about, changes that much conceptually from year to year, and perhaps if people writing the code were more familiar with first principles of comparative biology, they would not make so many errors that require such frequent updating.
January 10, 2018
Unless the software provides a result that is in error, it is not 'bad' software. There is no harm in suggesting that one use the most up-to-date versions of an analytical package. But the user will weigh the potential benefits of using more recent packages (and any newer technique of any kind) against the effort required to learn the newer package and the desire to compare recent results against prior results using the same methods.
The only serious question is whether there is an inherent flaw in a program that introduces errors and alters conclusions. For the TopHat example, if the alignments with the genome are incorrect and introduce errors, then the software can be classified as "bad". If not, then the alignments themselves are not in question and the results are not in question. Sometimes a version upgrade provides no improvement in the accuracy of the result you are interested in and the data reported would be the same regardless of the version used. If the argument is that a newer version is more convenient to the user, the user will decide whether that is true based upon her/his priorities.
January 11, 2018
As a scientist from a different field with a slower pace of change, I read this article with some bewilderment.
First of all, the only reproducibility issue I can see is people not mentioning which software version they used. Using older versions may be sub-optimal but nevertheless perfectly reproducible.
Second, when I read a description like this:
"For example, while versions 18 and 19 both identified a total of around 220 genes showing significantly altered expression compared to controls, 10 genes that were identified using version 18 were omitted by version 19, and a further 15 genes that were identified using version 19 were missed by version 18."
my conclusion is that the method/software is immature and should not be used as the sole basis for any scientific conclusion. If versions 18 and 19 of the software produce significantly different results, I would expect that version 20 changes everything again and conclude that this is a tool under development rather than something one can rely on to do research.