FLICKR, MCKAYSAVAGEA growing body of research has highlighted scientists’ inability to reproduce one another’s results, including a 2012 study that found only 11 percent of “landmark” cancer studies investigated could be independently confirmed. This dismally low reproducibility figure is estimated to be even lower for large-scale omics studies because outside reviewers are often stymied by a lack of detailed protocols and access the resources needed to perform the analyses.
“Some communities have standards requiring raw data to be deposited at or before publication, but the computer code is generally not made available, typically due to the time it takes to prepare it for release,” explained Victoria Stodden, an assistant professor of statistics at Columbia University.
The inability to validate is particularly troubling because omics studies are understood to be error-prone. Given the sheer size of most data sets, it is not uncommon for even highly unusual events to occur by chance. On top of that, the increasingly complex computational tools used for high-throughput analyses are often subject to biases and errors.
While some journals have tried to make the research process more transparent—Nature and Science, for example, require authors to make their data available whenever possible, and the latter publication has extended this requirement to include code and software—uptake has been spotty. In a June PLOS ONE study, Stodden and her colleagues showed that only 38 percent and 22 percent of submitting authors adhere to journals’ data and code policies, respectively.
Meanwhile, though, some scientists are openly archiving their data and code on their own, either through personal or institutional websites, or on sites like the Reproducibility Project’s Open Science Framework and RunMyCode.org. Some, too, are using workflow platforms—such as GenePattern, MyExperiment, Galaxy and Taverna, to name a few—to help other investigators replicate their results.
Seattle-based Sage Bionetworks is taking a different approach—one that “makes reproducibility a byproduct of the research process itself, rather than simply a burden at time of publication,” said Stephen Friend, the organization’s co-founder, director, and president.
Sage’s solution? An open-source computational platform, called Synapse, which enables seamless collaboration among geographically dispersed scientific teams—providing them with the tools to share data, source code, and analysis methods on specific research projects or on any of the 10,000 datasets in the organization’s massive data corpus. Key to these collaborations are tools embedded in Synapse that allow for everything from data “freezing” and versioning controls to graphical provenance records—delineating who did what to which dataset, for example.
By incorporating these tools throughout the research cycle, rather than as post-hoc descriptions added at the time of publication, Friend added, “data and analysis resources created by collaborations can be released to the general research community for verification.”
Synapse is being put to the test this year, serving as the framework for eight big data computational challenges hosted by Sage and DREAM (Dialogue for Reverse Engineering Assessments and Methods), a distributed systems biology group. The challenges, which require participants openly share code and analyses through Synapse, catalyze diverse teams of researchers around a central goal, such as developing a predictive model of disease. Participants reuse and build upon one another’s work to generate winning models. Results from one such challenge—the Breast Cancer Prognosis Challenge—were published in Science Translational Medicine in April.
In addition to the narrative text summarizing the study, the paper links out to the full spectrum of the study’s details, as captured in Synapse. This allows others to not only to read the articles, but provides “everything needed everything needed to simultaneously reproduce the very same analyses reported in the paper.”
Synapse is also powering The Cancer Genome Atlas’s (TCGA) Pan-Cancer project, a massive effort to chart the molecular landscape of the first 12 tumor types profiled by its participants. The effort includes 250 researchers—spread across 30 institutions—running 60 different research projects based on the integrative analysis of 1,930 input data files. Because many such projects are interdependent, the researchers used Synapse to manage multi-stage analyses and sharing their results.
“It was indeed the connecting data framework that held the entire project together,” said Josh Stuart, professor of biomolecular engineering at the University of California, Santa Cruz, who is part of the TCGA-led project.
Much like the Breast Cancer Prognosis Challenge publication, papers stemming from Pan-Cancer include embedded links to well-curated, analysis-ready data sets, code, and detailed provenance records that Friend noted “allowing others to rerun the analyses from scratch and verify results.”
“It provides a framework for the science to be extended upon, instead of publication as a finite endpoint for research,” he added. As the Pan-Cancer project expands, data sets generated by the project will continue to be maintained, and new information made immediately available to the community to allow anyone to contribute to the effort.
Friend said that the approach made possible through Synapse not only increased the resource value of the project’s work but accelerated the pace of scientific progress. Moreover, exposing the entire research process, any stage of analysis may serve as a starting point for additional scientific projects—a level of transparency and reproducibility Friend said will “transform data into knowledge and knowledge into discovery.”