In a test of scientific reproducibility, multiple teams of neuroimaging experts from across the globe were asked to independently analyze and interpret the same functional magnetic resonance imaging dataset. The results of the test, published in Nature today (May 20), show that each team performed the analysis in a subtly different manner and that their conclusions varied as a result. While highlighting the cause of the irreproducibility—human methodological decisions—the paper also reveals ways to safeguard future studies against it.
“This is a landmark study that demonstrates clearly what many scientists suspected: the conclusions reached in neuroimaging analyses are highly susceptible to the choices that investigators make on how to analyze the data,” writes John Ioannidis, an epidemiologist at Stanford University, in an email to The Scientist. Ioannidis, a prominent advocate for improving scientific rigor and reproducibility, was not involved in the study (his own work has recently been accused of poor methodology in a study on the seroprevalence of SARS-CoV-2 antibodies in Santa Clara County, California).
Problems with reproducibility plague all areas of science, and have been particularly highlighted in the fields of psychology and cancer through projects run in part by the Center for Open Science. Now, neuroimaging has come under the spotlight thanks to a collaborative project by neuroimaging experts around the world called the Neuroimaging Analysis Replication and Prediction Study (NARPS).
Neuroimaging, specifically functional magnetic resonance imaging (fMRI), which produces pictures of blood flow patterns in the brain that are thought to relate to neuronal activity, has been criticized in the past for problems such as poor study design and statistical methods, and specifying hypotheses after results are known (SHARKing), says neurologist Alain Dagher of McGill University who was not involved in the study. A particularly memorable criticism of the technique was a paper demonstrating that, without needed statistical corrections, it could identify apparent brain activity in a dead fish.
Perhaps because of such criticisms, nowadays fMRI “is a field that is known to have a lot of cautiousness about statistics and . . . about the sample sizes,” says neuroscientist Tom Schonberg of Tel Aviv University, an author of the paper and co-coordinator of NARPS. Also, unlike in many areas of biology, he adds, the image analysis is computational, not manual, so fewer biases might be expected to creep in.
Schonberg was therefore a little surprised to see the NARPS results, admitting, “it wasn’t easy seeing this variability, but it was what it was.”
The study, led by Schonberg together with psychologist Russell Poldrack of Stanford University and neuroimaging statistician Thomas Nichols of the University of Oxford, recruited independent teams of researchers around the globe to analyze and interpret the same raw neuroimaging data—brain scans of 108 healthy adults taken while the subjects were at rest and while they performed a simple decision-making task about whether to gamble a sum of money.
The researchers recruited the teams via social media and announcements at conferences, says Schonberg, adding that the response was amazing. “When we got 70 teams, we thought, ‘wow, this is a strong community that wants to know what’s going on and how can we improve.’”
There are just too many decisions that need to be made on how to analyze these data.—John Ioannidis, Stanford University
The independent researchers had access not only to the raw image data but also the full details of the experimental design and protocols. They were asked to test nine specific hypotheses—each concerning whether gains or losses of activity in a particular brain region correlated with a certain decision.
Each of the 70 research teams taking part used one of three different image analysis software packages. But variations in the final results didn’t depend on these software choices, says Nichols. Instead, they came down to numerous steps in the analysis that each require a human’s decision, such as how to correct for motion of the subjects’ heads, how signal-to-noise ratios are enhanced, how much image smoothing to apply—that is, how strictly the anatomical regions of the brain are defined—and which statistical approaches and thresholds to use.
“There are just too many decisions that need to be made on how to analyze these data and not surprisingly all these 70 teams did something different and often reached very different conclusions,” writes Ioannidis.
The study “is really important,” says Roeland Hancock, a neurolinguistics researcher at the University of Connecticut who headed one of the 70 teams analyzing the data. “It speaks to the issues of reproducibility and where that variability is coming from: the unintentional degrees of freedom we have in our analysis.”
Some results were largely consistent. For example, 84 percent of the teams agreed that the data supporting hypothesis 5—a prediction that tied loss of activity in the ventromedial prefrontal cortex to loss of money—was significant. And more than 90 percent of the teams found that three other hypotheses were insignificant. But for the remaining five hypotheses, the teams’ conclusions varied.
“The lessons from this study are clear,” writes Brian Nosek, a psychologist at the University of Virginia and executive director of the Center for Open Science. To minimize irreproducibility, he says, “the details of analysis decisions and the underlying data must be transparently available to assess the credibility of research claims.” Researchers should also preregister their research plans and hypotheses, he adds, which could prevent SHARKing. Preregistration can be done here or here. And they should analyze their data with multiple methods, such as using different software and settings. Such a “multiverse approach” would help identify robustly significant results from those where significance depended on how the analysis was performed.
The paper lays out these same recommendations, but “this take home is not only for neuroimaging,” says Schonberg. The need to take such steps to increase the reliability of results applies to “every field in science. . . . Every time you have humans and a complex pipeline—a set of decisions with bifurcations—this is what you’ll end up with.”
R. Botvinik-Nezer et al., “Variability in the analysis of a single neuroimaging dataset by many teams,” Nature, https://doi.org/10.1038/s41586-020-2314-9, 2020.