NOAAIn the past several months, a movement has sprung up among librarians, environmental and computer scientists, and supporters of access to public data to create archives of environmental information. While the volume of material is daunting (we’re talking hundreds of millions of webpages), volunteers have made considerable headway, collecting terabytes (TB) of data to date, with more being collected all the time.
One team, called the Azimuth Climate Data Backup Project, had by February 11 backed up 19 TB of data from NASA, the National Oceanic and Atmospheric Administration (NOAA), and other federal agencies that collect climate-related data. “I know there is a gap, because there are several large datasets which are still being downloaded, mostly by me,” Jan Galkowski, a statistician who contributes to the group, told The Scientist in an email. “We have the capability and will probably fill 40 TB of storage with the data we have replicated. This will take some time to move to its eventual homes, simply because network transfer speeds are not that high.”
The election of President Donald Trump, and expectations that climate-related projects in particular would face cuts during his administration, has stoked fears that access to government environmental data may be in peril. “We’re not going to see any disappearing data overnight,” said Michelle Murphy, director of the Technoscience Research Unit at the University of Toronto. “One way to lose data is to close a program. . . . . [Its dataset] doesn’t have to be deleted, it just becomes uncared for, goes offline, goes into a drawer.”
In response to those concerns, Murphy helped start the Environmental Data and Governance Initiative (EDGI) several months ago. The volunteer-run project coordinates the collection and archiving of US government data through so-called data rescue events, in which people meet up to allocate expertise and computing power to saving particular datasets or websites. Dozens of these events have been held around the U.S. and Canada so far, with more scheduled.
EDGI partners with the DataRefuge network—an initiative launched by the Penn Program in the Environmental Humanities at the University of Pennsylvania—and the Internet Archive to store the webpages and data that participants collect. According to Dawn Walker, who works with EDGI, 1.7 Tebibytes (TiB) have been downloaded for inclusion in DataRefuge, including 158 datasets that are now available for download. Data rescue volunteers have also nominated 73,500 URLs from government websites to be crawled by the Internet Archive. (Crawling is the process of downloading a website, then fanning out to each link from that site and downloading those, and so on.)
Jefferson Bailey, the director of web archiving at the Internet Archive, said some of these would likely have been included in the organization’s already-scheduled End of Term crawl, which collects URLs from .gov and .mil at the turnover of each presidential term. But the efforts of EDGI are complementary. Four years ago, Bailey said, people nominated only around 1,400 URLs to crawl. “We can crawl, scale, and store a lot,” he told The Scientist. “But we don’t the have time and staff to host events.” Working with EDGI, he said, has “been a good pairing.”
In addition to the organized efforts of EDGI and Azimuth, concerned citizens have made their own individual contributions. Bryce Lynch, a Bay Area security specialist who previously worked at NASA, has been downloading sensor-buoy data from NOAA continuously for the past two months. He also participated in a local data rescue event.
“I’ve got maybe 29-30 terabytes here on my rack. . . . I’m devoting half of my bandwidth to downloading this stuff,” he said. “I’ve been going at it for two months now. I’m not stopping anytime soon.”
Nor is the guerilla archiving momentum. At Rice University in Houston last Saturday, around 75 volunteers spent the day searching for websites to be archived, or writing code to harvest data. Kathy Hart Weimer, head of the Kelley Center for Government Information, Data and Geospatial Services at Rice’s Fondren Library, said she was inspired to organize the event knowing what had happened in previous administrations when the Environmental Protection Agency’s budget was cut. “That caused some libraries to close,” she said. “Librarians who remember that are attuned to federal budgets. We want to make sure information is maintained so not only scientists can access the data, but the public as well.”
As people were packing up to leave the data rescue event, said Weimer, people were asking: “Are we going to do it again?” Maybe they will, she told The Scientist. “We’ll see what happens next.”