WIKIMEDIA, MIGUEL ANDRADEA paper in PLOS Biology today (June 8) describes Wide-Open—an automated system that scans published papers for references to publically available datasets and determines whether those data are indeed available. The system, which identified hundreds of datasets overdue for public release in one particular functional genomics data repository, has garnered resounding support from researchers, open-science advocates, and database curators alike.
“[The system] is remarkably simple, very straightforward, and . . . very impactful,” says biological data analyst and open-science proponent Titus Brown of the University of California, Davis, who was not involved in the study. “It is a really great example of a simple idea that’s easy to implement that nobody else thought of.”
Advances in biological techniques and computational technologies mean it has never been easier for scientists to accumulate, store, and, in the interests of collective knowledge, share their data. Indeed, for many biologists, a normal course of events is to generate data, submit it to a centralized repository, and then make these data available to the public upon publication of the associated study.
But, as Maxim Grechkin and Bill Howe of the University of Washington in Seattle discovered, sometimes the data-release step doesn’t happen. These computational scientists had been attempting to study the way researchers share and reuse data, but “during the process,” says Grechkin, who is a PhD student in Howe’s lab, “we found out that some people claim that they’ve released their datasets, but [they haven’t].”
Grechkin and Howe together with Microsoft researcher Hoifung Poon created computer code that allowed them to download all publically accessible scientific papers from PubMed Central, scan the text for public database identifiers, or accession numbers, and run those identifiers against the public repositories to determine whether or not the data were actually public.
Using this system, named Wide-Open, the researchers ran an initial scan for accession numbers of datasets held at the Gene Expression Omnibus (GEO), part of the National Center for Biotechnology Information (NCBI). They found that from approximately 25,000 papers containing approximately 29,000 GEO accession numbers, 473 datasets were potentially overdue. So in February, 2017, they alerted GEO.
Of the 473 datasets, 429 were indeed overdue and immediately released to the public, says GEO’s lead curator, Tanya Barrett, while a further 27 had already been released by the time GEO received the alert. “We release data everyday,” Barrett explains.
“We knew there were some overdue datasets, so it wasn’t a huge surprise,” says Barrett, “but what the Wide-Open project clearly showed was that the backlog was growing. That was the really useful part for us.”
Another 14 of the datasets had also been released, but the accession numbers cited in the papers, and thus picked by Wide-Open, contained typos. The remaining three datasets could not be released either because of incomplete submission or privacy issues, the authors explained.
The team also used Wide-Open to search published papers for accession numbers within the NCBI’s Sequence Read Archive (SRA), and found 84 potentially overdue datasets.
Prior to the automated Wide-Open system, GEO had been relying both on database users to alert them to overdue datasets—for example, complaints from people unable to access a published accession number—and on text searches of newly published papers for accession numbers. “But these are not automated, they are manual processes,” Barrett says. GEO and SRA now both plan to “add Wide-Open to the tools we use,” she says.
Wide-Open continues to be regularly updated with potential overdue GEO and SRA datasets, and the team plans to add search capabilities for further public repositories in the future, says Grechkin.
“This is an excellent initiative and the early success with GEO is quite astonishing,” says Brian Nosek, co-founder and executive director of the Center for Open Science, which was not involved in the project.
But, there is a major limitation with Wide-Open. The program depends on having publically accessible papers to search, explains Grechkin. Journals that are behind a paywall are therefore unsearchable without either paying for a subscription, or receiving permission from the publishers. “This is yet another argument for open-access [publications],” says Brown.
“Even though Wide-Open is currently very limited, it’s a very welcome first step towards automating checks for [data-sharing] compliance,” says former managing editor at the journal Molecular Ecology and at Axios Review Tim Vines, who also did not participate in the study.
So why, if scientists are willing to submit their data to public repositories, does non-compliance occur?
“Failures to follow through on data sharing intentions are often due to an innocent, human reason: forgetting,” says Nosek. “Researchers are busy and . . . a commitment months ago to share data [can easily] be missed or forgotten.”
Indeed, “we try to keep track, but sometimes it’s six months or even two years that a paper is bouncing around from journal to journal, says genetic medicine researcher Ronald Crystal of Weill Cornell Medicine in New York who at the time of the interview had one of the 200 or so overdue datasets listed on the Wide-Open website. “Some things fall through the cracks.”
This sentiment was echoed by computational toxicolgist Patrick McMullen of ScitoVation and translation researcher Jamie Cate of the University of California, Berkeley, both of whom also had overdue datasets. Crystal, Cate, McMullen, and other researchers who responded to The Scientist were all supportive of Wide-Open, and plan to now release their data.
“I think [Wide-Open] is a great system,” says McMullen. It’s helpful to the authors, to the repository curators, as well as to the people that want to use the data, he says. In short, “it’s a win-win, or even a win-win-win.”
M. Grechkin et al., “Wide-Open: Accelerating public data release by automating detection of overdue datasets,” PLOS Biology, 15: e2002477, 2017.