A large-scale project designed to assess reproducibility in preclinical cancer research has identified significant challenges associated with repeating other scientists’ work, renewing calls for increased transparency and data-sharing in the biomedical community.

The project, launched in 2013 by the nonprofit Center for Open Science (COS) in collaboration with the online research marketplace Science Exchange, attempted to reproduce key results from more than 50 high-impact studies published between 2010 and 2012. 

Over the next eight years, the researchers managed to repeat experiments from a little under half of those studies, and found that the results they obtained were typically far less clear-cut than the ones reported in the original papers—an assessment that has drawn criticism from some of those papers’ authors. 

For the remainder of the studies, the team often wasn’t able to obtain enough information about the methods used from either the papers or their authors, and had to abandon attempts at replication altogether.

The project’s conclusions are summarized in two articles published today (December 7) in eLife.

Mike Lauer, deputy director for extramural research at the National Institutes of Health, congratulated the COS team on its “outstanding work.” Speaking as a commentator alongside the project’s organizers at a virtual press conference moderated by eLife last week, he emphasized that the findings don’t mean that science is “untrustworthy” or “hopeless,” but rather underline the need for a more open culture that values transparency and scientific process above individual publications. “A scientific paper rarely tells the whole story,” he said. “This is a rigorous way of showing that.” 

The project

Often blamed for challenges in translating animal research to humans, the problem of poor reproducibility has long been a source of concern in the biomedical sciences. The COS project, which was funded by the Laura and John Arnold Foundation, was one of several discipline-wide efforts launched in the 2010s to dig into the problem and provide possible solutions. 

The aim was to repeat 193 experiments from 53 high-impact cancer studies published in journals including Nature, Cell, and Science. Papers were selected primarily on the basis of citation counts, although studies describing clinical trials, case reports, or experiments that required access to particularly hard-to-obtain equipment or samples were excluded. 

Each study would be assigned to a group of researchers who would gather the necessary information to write a registered report, a paper describing exactly how they planned to repeat and analyze experiments. 

It’s like a quality check—is this the replication rate we would expect to see? . . . And if it isn’t, where are we going wrong?

—Marcus Munafò, University of Bristol

These reports would undergo peer review for publication in eLife, a step designed to boost transparency by spelling out plans in advance of any lab work, COS’s director of research Tim Errington told reporters at the press conference last week. Then, the teams would carry out the experiments and compare the findings to those of the original paper. 

Researchers have now completed all or most of this process for 23 of the original papers, leading to a total of 50 repeated experiments and 158 measured results. These replication studies underwent their own peer review to be published as papers in eLife.

While a handful of these studies successfully reproduced key parts of the original papers, many could reproduce only some parts, or none, or the results couldn’t be clearly interpreted. Using a meta-analysis to broadly compare the effects reported in replicated studies with those reported in original studies, Errington and colleagues found that the former tended be smaller and were less likely to be significant than the latter. 

For example, of 97 numerical effects (such as a percentage increase in a particular metabolite’s concentration) that were statistically significant in the original studies, only 42 in the repeated studies were statistically significant and in the same direction. Seven were statistically significant in the opposite direction—an increase in the original study, for example, became a decrease in the replication attempt—and the rest were statistically insignificant, or null, results.

Differences in interpretation

For researchers who have been following reproducibility issues in scientific research, the results are unlikely to be surprising, says Marcus Munafò, a biological psychologist at the University of Bristol and chair of the steering group for the UK Reproducibility Network, a consortium aiming to improve the quality of scientific research. “It really just confirms the findings of other similar studies in slightly different domains,” Munafò, who was not involved in the COS project, tells The Scientist

More than providing an assessment of individual studies, these kinds of large-scale projects act as a “diagnostic test” of how science is working, he says, adding that while the studies were published a decade ago, the project’s findings are still pertinent to our understanding of science. “It’s like a quality check—is this the replication rate we would expect to see? . . . And if it isn’t, where are we going wrong?” 

See “UK Group Tackles Reproducibility in Research

Brian Nosek, COS’s executive director and cofounder, offered a similar interpretation at the press conference, saying that the project doesn’t provide answers to why a particular result could or couldn’t be reproduced. He said possible explanations include that the original result was a false positive; that the replication study produced a false negative; or simply that there was some small difference in conditions that led to a different outcome, so neither result was really incorrect.

This last possibility was highlighted by several scientists whose work was selected as part of the reproducibility project. Robert Holt of the BC Cancer Research Institute in Vancouver was senior author on a 2011 paper in Genome Research that reported an elevated prevalence of Fusobacterium nucleatum in colorectal tumors compared to normal tissue. The replication study associated with that paper, which Holt was not involved in, concluded that “the difference in F. nucleatum expression between [colorectal carcinoma] and adjacent normal tissues was . . . smaller than the original study, and not detected in most samples.” Holt writes in an email that he disagrees with some of the analyses and conclusions in the COS team’s paper. 

“I generally support these efforts, they are important contributions,” Holt notes, “but I also think that ‘generalizability studies’ would be a much better way to frame them because the question asked is usually how results generalize across different settings, experimental conditions and analytical approaches.”

The project doesn’t provide answers to why a particular result could or couldn’t be reproduced.

Miguel Del Pozo of the National Center for Cardiovascular Research in Madrid agrees that reproducibility projects are important, adding that he took it as a compliment when the COS reached out about repeating part of his lab’s work. He coordinated with various colleagues and collaborators on the original paper to help the COS researchers obtain the information and materials they needed, he tells The Scientist

But he notes several differences between his lab’s 2011 Cell study tracking extracellular matrix remodeling in the tumor microenvironment and the COS’s replication attempt. For example, in both studies, tumor-bearing mice were sacrificed for ethical reasons once their tumor burden got too large—but this endpoint was met after just 45 days in the replication attempt, compared to 70 days in the original study, a difference that the COS researchers acknowledge in their replication report prevents a clear comparison of the two studies. 

Del Pozo adds that, in general, differences between original studies and replication attempts likely often come down to expertise—some biomedical techniques can take months or years to hone, he says, so it’s not surprising they would be difficult to recreate in another lab. 

See “Potential Causes of Irreproducibility Revealed

Dean Tang, a cancer biologist at Roswell Park Comprehensive Cancer Center whose 2011 Nature Medicine study on cancer stem cells was also selected, is more critical of the project. He tells The Scientist in an email that he was “extremely disheartened” by the COS team’s approach, and contests the validity of the group’s replication attempt of his work. Among other things, the COS study reported that levels of the microRNA miR-34a were elevated in certain tumor cells, rather than reduced as reported in the original paper.

In a 2019 letter to eLife, Tang and a coauthor write that COS’s researchers “deviated significantly from our original work,” listing several examples of differences in methods and materials. The pair add that although they “applaud conceptually” the project’s aims, “we find that this Reproducibility Project study is highly flawed, and contrary to the original [intentions] does very little to help the scientific community or greater public good.”

See “Research Teams Reach Different Results From Same Brain-Scan Data

Errington acknowledged that some replication studies had had to modify the original protocols due to financial, practical, or time constraints, but said that researchers on the project stuck as closely as possible to the original experiments, and always documented where they had deviated.

Barriers to reproducibility

Alongside the scientific work, COS researchers collected data on the practical obstacles they faced during the project. One of the main reasons replication attempts had to be abandoned was a lack of information: of the original 53 papers earmarked by the project, none could be recreated just from the methods described in those papers. 

This meant that researchers had to reach out to authors to request information about materials, experimental protocols, analyses, and other details that had been omitted in the published studies for space or other reasons. Responses to these requests were mixed: some authors were very helpful and responded to multiple follow-up questions and/or shared materials, the team’s data show. Addressing reporters last week, Marcia McNutt, the president of the National Academy of Sciences, referred to these authors as the “true heroes of this project.” 

But for 32 percent of the experiments the team wanted to replicate, authors didn’t respond or were “not at all helpful,” Errington said; some replied by questioning the value of the project. He blamed these kinds of communication problems not only for causing some replication attempts to be dropped, but for drawing out the time it took to complete the rest—an average of 197 weeks from selecting a study for replication to publishing the final report.

Discussing possible solutions, McNutt said that better incentives are needed to help boost sharing among scientists. Currently, “there’s little incentive to cooperate with a replication,” she said, particularly when it comes to papers that are already highly cited. Some practices—such as data sharing and code sharing—are already beginning to be mandated by journals or funders, she added. Lauer noted that the NIH will be implementing a new and improved data-sharing policy in 2023.

However, the speakers also noted that some of the bigger problems are cultural—namely, the scientific community’s tendency to view replications as personal attacks rather than as opportunities to advance understanding, and to treat publications as completed products rather than works in progress. “The problem is that we focus on outputs rather than process,” says Munafò, echoing a point made by several speakers at the conference. “That’s what I think fundamentally needs to change.”

Clarification (December 10): This article has been amended to include more detail about the replication study of Robert Holt’s work on colorectal cancer.