Inside the Project Trying to Save Datasets from Extinction

Just after midnight on March 24, 1989, the Exxon supertanker Valdez slammed into Bligh Reef in Prince William Sound, Alaska. The resulting oil spill was an unprecedented disaster for the region, its fish and rich wildlife, and the people and industries who depended on them. In the aftermath, more than $150 million of civil suit settlement money was allocated to ecological research and monitoring efforts to help scientists understand and mitigate the long-term effects of the spill.

Three decades later, most of the data collected in the wake of the disaster have gone missing. A five-year project that began in 2012 to recover the original data turned up just 30 percent—the rest were never digitized, never shared, or kept in a format inaccessible to outside researchers. In purely financial terms, a new study estimates that more than $100 million was spent to collect data that, effectively, no longer exist.

“Truly wild” is how University of Arizona community ecologist and study coauthor Ellen Bledsoe describes the scale of the Valdez data loss. Tallying it up “was definitely eye-opening, just as a way of quantifying monetarily how much data is lost.” Bledsoe and colleagues at the Canadian Institute of Ecology and Evolution (CIEE) published their estimate earlier this year alongside guidelines for the recovery and archiving of important ecological data. As part of CIEE’s Living Data Project, their goal is to identify datasets in danger of loss and take steps to preserve them before they disappear into the ether. Data rescue is the official term, but Bledsoe says she likes to think of it as “data necromancy”—bringing data back from the dead.

The project tackles a common contradiction in science. Without data, there is nothing to analyze and no way to test any hypothesis. Yet once they have produced results and publications, data are sometimes treated as tools that have outlived their usefulness, rather than the valuable, and often irreplaceable, records that they are. “Data have been seen as not exciting. They’re not science, they’re not proper idea generation,” says CIEE board member Alison Specht. “They’re a means to an end, and the curation, the management, and sharing of data was a time-consuming and rather low-grade task, and not usually funded.”

Ecology professor George H. La Roi’s data—collected over 35 years of studying North American boreal forests and stored in notebooks, CD-ROMs, and slides—are now preserved by the Living Data Project.

THE LIVING DATA PROJECT

The Living Data Project, which got its funding from the Natural Sciences and Engineering Research Council of Canada (NSERC), aims to address both the immediate problem of data loss and the underlying cultural causes. The project trains graduate students on data management, then matches them with data owners such as research organizations or retiring academics. Students help clean and process aging datasets, eventually sharing them in an accessible repository.

“There are no courses in most biology curricula that teach people how to manage their data,” says Dominique Roche, a postdoctoral fellow at the University of Neuchâtel in Switzerland and coauthor on the Living Data Project paper. “It seems like such an essential skill. I guess it’s assumed that people who do research know how to work with data, but that’s the biggest fallacy ever.”

The number of older projects in need of rescue can be daunting—sharing data was rare in ecology before journals and funding agencies started to require it in the early 2010s. So the team’s new paper gives guidelines for prioritizing certain projects over others, including studies that cover a long period of time, a large geographical area, or multiple species. These are likely to be the ones that are most useful to future researchers, says Bledsoe, although she acknowledges that there are exceptions. If a biologist studies lions, a small but detailed dataset of lion behavior could be more useful than a continental-scale, long-term ecological dataset that doesn’t include lions. “It really is one man’s trash is another man’s treasure.”

Another factor in setting priorities for rescue is the risk of permanent loss. Information stored only on paper or on outdated media like floppy disks is especially vulnerable. Sometimes data are stored in official university department space, but just as often, they can end up in researchers’ garages or handed down to their children. In their paper, Bledsoe and colleagues describe the example of University of Alberta forest ecology professor George H. La Roi. Upon La Roi’s death, his children bequeathed his collected notebooks, CD-ROMs, and slide images from 35 years of studying North American boreal forests to one of his colleagues. The Living Data Project was able to match the new owner with trained students to restore and preserve this irreplaceable ecological record.

Cartoon of people searching for data through books and a computer screen.

ANDRZEJ KRAUZE

Technical advances are making data preservation easier and more reliable than ever before. Repositories are much more common than they were even a few years ago, and programs such as CoreTrustSeal, established in 2017 through an international collaboration of organizations focused on data archiving and transparency, now grant certification to repositories that are sustainably maintained and updated.

Still, technological developments don’t address a lack of incentives to maintain datasets in a usable state, says Mark Westoby, a professor emeritus at Macquarie University who is not involved in the Living Data Project. “Academic careers run on publication,” he says. “It’s by far the most important incentive for how academics—probably government scientists as well—decide how to spend their time.” Westoby recently coauthored a paper calling for a new career currency for data providers, apart from publications and journal impact scores—but such cultural changes take time, he says. While funding agencies and scientific journals are increasingly implementing data-sharing requirements, these can lead to a letter-of-the-law approach, he adds, where some data are shared to meet requirements, but not necessarily in complete datasets or in a particularly legible format.

Westoby is supportive of the Living Data Project’s efforts to rescue old data-sets, but notes that the group’s paper sidesteps the costs of doing so as well as the motivation issue. “Having guidelines to revive, resurrect, rescue data that otherwise might be lost is all good advice. It didn’t really tackle the question of how many person-hours and person-years are we talking about, and is it worth it?”

Ultimately, everyone who spoke to The Scientist agreed that the ideal system is one where rescue isn’t necessary at all. “Data rescue is a great concept,” says Roche, “but ideally what we want to do is get rid of data rescue. It would be a lot less work for people if they thought of data management and sharing from the very onset of a project, so that data are not at risk of being lost.”