Earlier this summer, Vaughn Cooper, an evolutionary biologist at the University of Pittsburgh, was busy promoting a new secondary school curriculum for teaching evolution to scientists and educators. He and his colleagues had published the program in Evolution: Education and Outreach in April, and they were eager to spread the word before the start of the upcoming school year. So when Cooper received an email from a colleague who couldn’t access his manuscript’s supplementary files because of broken hyperlinks, he was frustrated by the news.
The supplementary documents contained important information, such as the experimental protocols for students that his team had tested. This was not the first time that he’d come across issues with these types of files. “I’ve had multiple instances from multiple publishers where the supplementary material has gone missing,” he says, adding that this has occurred with both his papers...
Cooper went to Twitter to vent his frustration. In response, other scientists noted that they, too, had experienced similar problems. “I am afraid this is not uncommon,” tweeted Peter Murray-Rust, a chemist at the University of Cambridge. “Many (not all) journals generally regard supplementary data as a pain in the neck.”
In addition to broken links, scientists point to other problems plaguing these files—such as their increasing length and the inaccessibility of the formats they are published in. As a result of these issues, both academics and publishers are increasingly turning to independent, online repositories as one potential solution.
Last but not least
Supplementary files are not the most glamorous part of a scientific paper. They’re tucked away at the end of manuscripts, typically as links to downloadable PDF files. These documents can easily be overlooked by readers, but for many researchers in that particular field, the cargo they carry is precious. They often hold material—such as data, methods, and equations—that could be useful in future experiments and are vital for assessing the reproducibility of a study’s findings.
Some publishers, such as F1000, have stopped accepting supplementary material and instead require authors to submit data into an approved repository and cite it in their manuscript.
“Supplementary materials are a key part of the scientific record, and if they are not accessible then it becomes harder for others to replicate research,” Kat Holt, a computational biologist at Monash University in Australia and the London School of Hygiene and Tropical Medicine, writes in an email to The Scientist.
Supplementary materials can also contain information critical to understanding the paper itself. “Increasingly, we’re being told to relegate material to the supplement,” meaning key elements that usually belong in the paper, such as results, sometimes end up in these documents, Cooper says. “When you put results in the supplementary files, then it means the main claims of the paper are possibly being relegated to this material that’s barely read.”
According to Cooper, he and his coauthors decided to move some of the results in his Evolution: Ecology and Outreach paper to the supplement to prevent the manuscript from becoming too long. “We thought that this would be good for readers, but now we’re recognizing that maybe we should have worked harder to make it part of the main text,” he tells The Scientist.
The broken links on Cooper’s paper were “an unintended consequence of making some improvements to how all of our articles are structured to make them machine-readable,” says Grace Baynes, the vice president of research data and new product development at Springer Nature, the journal’s publisher. “We are working hard on both fixing the problem with this particular article and ensuring that it doesn't happen again.”
But it’s not just broken hyperlinks that frustrate scientists. As papers get more data-intensive and complex, supplementary files often become many times longer than the manuscript itself—in some extreme cases, ballooning to more than 100 pages. Because these files are typically published as PDFs, they can be a pain to navigate, so even if they are available, the information within them can get overlooked. “Most supplementary materials are just one big block and not very useful,” Cooper says.
Another issue is that these files are home to most of a study’s published data, and “you can’t extract data from PDFs except using complex software—and it’s a slow process that has errors,” Murray-Rust tells The Scientist. “This data is often deposited as a token of depositing data, rather than people actually wanting to reuse it.”
Repositories on the rise
After Cooper alerted the journal to the broken links, he decided to deposit the supplements online. “I took the bull by the horns and published the supplementary materials on our website, and pointed back to the publication,” he says. (As of today, the files are still inaccessible on the journal website.)
Depositing material that would end up in supplementary files in places other than the journal is becoming an increasingly common practice. Some academics opt to post this information on their own websites, but many others are turning to online repositories offered by universities, research institutions, and companies. There is a huge range of options for authors to choose from: in addition to popular generalist repositories, such as figshare, Zenodo, and Dryad, there are dozens of subject-specific databases, such as GenBank for genetic sequences, OpenNeuro for neuroimaging data, and the Crystallography Open Database for crystal structures.
If journals don't want to curate these properly, then just let us use another service with a DOI and link to them - e.g. dataverse, figshare. Or host them with a separate DOI?— Alan Huett (@ahuett) July 30, 2019
There are advantages these repositories provide over journal articles, according to Holt. For one, repositories offer the ability to better store and interact with large amounts of openly accessible data than journals typically do. In addition, repositories’ files are labelled with a digital object identifier (DOI), meaning researchers can easily link to it from a published article and make sure to get credit for their work.
“Speaking personally as an author, I much prefer [the system provided by these repositories], because it lets me control how I manage my supplementary data files, with no imposition from the journals as to what file formats I can use,” Holt tells The Scientist. In addition to making it easy to share and re-use datasets, the “flexibility of file formats does away with one of my pet peeves—the dreaded PDF-formatted spreadsheet!”
Mark Hahnel, the CEO and founder of figshare, says that he started the company during his doctoral studies out of frustration with the limitations of supplemental files. “We expected to play this role for people who were producing outputs of research that didn’t fit into the model of publishing PDFs,” he tells The Scientist. But increasingly, academics also are using figshare for other reasons, he adds, such as being able to freely reuse material associated with a published paper without worrying about infringing upon copyrights. (While research outputs such as figures in a traditional journal may be subject to a publisher’s copyright policies, those deposited to repositories like figshare are usually published with a creative common license that allows others to use the material without restrictions.)
Publishers encourage repository use
Publishers are turning to repositories as well. For example, the Microbiology Society, a professional organization and publisher, recommends that authors who submit to their journal deposit their supplementary files into standalone repositories such as figshare and Microreact for data, and into protocols.io for methods. Although researchers have the option of publishing supplementary files along with their manuscripts, the society prefers the use of repositories, according to Holt, who is the current editor-in-chief of Microbial Genomics, a publication of the Microbiology Society.
Some publishers, such as F1000, have stopped accepting supplementary material and instead require authors to submit data into an approved repository and cite it in their manuscript. “Journals can change very quickly—systems change, editorial workflows change—and supplemental links are not at the forefront of the mind of a publisher,” says Michael Markie, the publishing director of life sciences at F1000. “We say you need to put [data] into a repository, that way we can guarantee that readers will always be able to find that information.”
Like the Microbiology Society, Springer Nature recommends that researchers deposit their data into suitable repositories. In addition to this endorsement, the publisher is also working with figshare to make depositing into repositories an integrated part of the journal submission process, according to Baynes. Figshare also provides services for F1000, the Microbiology Society, Wiley, PNAS, and PLOS. “We do that for almost every big publisher except Elsevier, who have their own plans,” Hahnel says.
According to a 2017 Springer Nature survey of more than 7,700 academics worldwide, approximately 63 percent of researchers shared data when they were publishing a manuscript—but the proportion of people who used supplementary files was slightly higher than those who used repositories. Replacing supplementary files with repositories in the future is “in my personal opinion, what we should be striving for,” Baynes says.
Some researchers still see academic journals as a better option for publishing supplementary files than independent repositories. “The traditional journal system is very entrenched in science and it just needs to be modernized,” says Mark Gerstein, a bioinformatician at Yale University. One fix Gerstein and his colleagues have proposed is a structured supplement, which would parallel the main text and be easily navigable via links from the primary manuscript. “The whole point of the supplement is to make the paper easier to read,” Gerstein adds. “We want people to have sections in the paper that they read quickly and enjoy, and we want them to know where to do that deep dive when they want to.”
“Supplementary information files remain the easiest way for researchers to [share data] in a way they’re familiar with . . . and at the moment, there aren’t enough incentives for researchers to spend their time putting data into a repository,” Baynes tells The Scientist. “The whole [scientific] community needs to work together to make it worth researchers’ time to manage and share their data in ways that are more optimal than supplementary files.”
Where to share?
There are more than 1,000 repositories where scientists can deposit data and documents associated with their manuscripts. The majority of these are subject-specific—there are ones specialized for chemical and molecular structures (Crystallography Open Database, Protein Data Bank, Coherent X-ray Imaging Data Bank), neuroimaging data (OpenNeuro, NeuroVault), and mathematical models (BioModels, The Network Data Exchange), just to name a few. Several publishers recommend that authors submit their material to such subject-specific repositories whenever possible.
Subject-specific repositories provide a few advantages, according to Grace Baynes, the vice president of research data and new product development at Springer Nature: they’re designed with the specific research community that’s using those data in mind, and putting data into a repository that your peers use may optimize your chances of connecting with a future collaborator. But such repositories still don’t exist for many research areas, and that’s where general repositories can play a key role, she adds.
Here’s a brief guide to some of the most commonly used general-purpose repositories.
|Repository name||Type of files accepted||Size limits||Submission fee||DOI assignment available|
|Dryad||Any format||None listed||$120 US per data package (all data associated with one publication)||Yes|
|figshare||Any format||5 GB per file for free accounts, but files up to 5 TB in size possible||Free for individuals, paid accounts for institutions||Yes|
|Harvard Dataverse||Any format||2 GB per file (multiple uploads possible)||Free up to 1 TB||Yes|
|Open Science Framework (Center for Open Science)||No restrictions listed||5 GB per file (larger files can be stored as add-ons from other providers)||Free||Yes|
|Mendeley Data (Elsevier)||Any format||10 GB per dataset||Free||Yes|
|Zenodo (CERN)||Any format||50 GB per dataset (larger files allowed on a case-by-case basis)||Free||Yes|
Diana Kwon is a Berlin-based freelance journalist. Follow her on Twitter @DianaMKwon.