Save our Data!

Here's how to prevent critical biological data repositories from disappearing into the ether

The Plant Genome Database Working Group
Mar 31, 2006
<figcaption> Credit: Photos Courtesy of Marc L. Lieberman/Salk Institute</figcaption>
Credit: Photos Courtesy of Marc L. Lieberman/Salk Institute

The public research sector has invested hundreds of millions of dollars in grants to generate large-scale biological data sets, most notably in the field of genomics. These large data sets include genome sequences, gene expression array results, extensive surveys of sequence variation within populations, and findings from protein-protein interaction studies. Such data sets are housed in many online repositories, ranging in size and scope from small single-organism databases, to large multiorganism databases such as the comprehensive GenBank sequence database.

The problem is that the majority of these databases were established by the initiative of individual researchers, and their longevity is constrained by the continuing enthusiasm of their founders and their prospects for long-term funding. In other cases, online databases were established by individual researchers, who, in the absence of an obvious repository for the data they were generating, released their data through their own Web or FTP sites. When the database's founder moves on to other projects, loses interest, or loses funding, there is a clear and present risk that the database will slowly decay from lack of updating and may eventually break down completely and disappear from the Internet. When that happens, the data will become inaccessible.

In January 2005, representatives of the National Science Foundation's Plant Genome Program and the US Department of Agriculture's Agricultural Research Service asked us to form a working group to examine these issues. Although our primary focus was on the needs of plant biology, our discussion and conclusions apply to the maintenance of other genome-scale data sets, including those of animals, fungi, protists, and prokaryotes.

Static Versus Curated Repositories

One of the core issues we wrestled with was the respective roles of static versus curated repositories. A static data repository - for example, the GenBank database, which contains chronologically ordered sequence submissions - is a relatively unchanging archive of information. The "business model" of a static repository is that of a self-service storage facility. The owner of the data checks his or her data set in, and only the owner has permission to modify it.

In contrast, a curated data repository can be likened to an art museum. Repository curators actively seek out new data sets to incorporate into the collection, and once a data set has been entered, they are free to reorganize and integrate it with other data sets, to find and annotate inconsistencies, and to add editorial comments. The National Center for Biotechnology Information has a well-known curated repository, EntrezGenes, which is a systematic collection of genes from multiple species that have been annotated by experts from each species' research community. Other curated repositories include model organism databases (MODs) focused on a single species, such as the WormBase database for Caenorhabditis elegans (www.wormbase.org), and TAIR, The Arabidopsis Information Resource (www.arabidopsis.org).

Researchers often prefer curated repositories, but supporting such a care-intensive facility comes at a cost: Such repositories are built and managed by biological curators, a specialized cadre of PhD-level biologists who combine their scientific expertise with information-management skills. A typical curated repository, with a staff of two curators and a half-time programmer, will cost in excess of $250,000 per year.

MODs Versus CODs

MODs are typically formed by research communities and often start out as an online directory for shared resources. For example, MaizeGDB began life as an electronic catalog of maize mutants and their genetically mapped locations. As the cost of genome-scale technologies has decreased, research communities have moved from analyzing single species to analyzing entire phylogenetic clades of related organisms. Making comparisons among multiple species is a powerful way both to identify functional elements in the genome and to understand how genes evolved in response to selective constraints.

The need to perform such comparisons has led to the creation of CODs (clade-oriented databases) that contain information on multiple related species and provide researchers with analysis and visualization tools for making comparisons within and among species. The contents of CODs can be manually curated or built automatically with computational pipelines, and most CODs combine elements of both manual and automated annotation. Examples of CODs include the Gramene database of cereal genomes (www.gramene.org) and the Genome Browser of vertebrate genomes (http://genome.ucsc.edu) at the University of California, San Diego.

These fundamental issues and developments were the basis of the group's principal recommendations, highlights of which follow:

1. Encourage CODs. Because multispecies databases provide researchers with a level of information that is not available from traditional MODs, the working group recommended that the funding agencies promote the formation of repositories that look beyond a single model species and to encourage both the formation of repositories equipped to store and analyze data from multiple species simultaneously, and the use of technologies that allow the information held in multiple databases to be compared and integrated.

2. Develop a funding mechanism that would support biological databases for longer cycle times than under current mechanisms. Presently, most curated databases are funded as research projects for cycles of three to five years under a process of competitive grant review. This is insufficient to establish a stable resource and to create an environment attractive to those biologists who wish to make a professional career of data curation. We encourage the funding agencies to develop a mechanism to fund static and curated repositories for renewable periods of seven to 10 years, subject to annual review by an advisory board and assessed by a set of objective measurements of performance.

3. Foster curation as a career path. The specialized cadre of PhD-level biologists who acquire, develop, and maintain integrated data sets is insufficient to meet the current needs. We recommend that the funding agencies as well as educational institutions put renewed emphasis on data curation as a respected career path. This would involve addressing issues related to developing the proper curricula, mentoring scientists, supporting specialty conferences, and developing peer-reviewed journals specializing in curation research and methodology.

4. Balance data generation and information management. Because the storage of data and/or reagents generated by high-throughput studies is so vital to the research community, funding agencies should insist that potential data providers include in their proposals a plan for long-term storage and maintenance of the projects' generated data sets and reagents. A minimum set of standards for the publication of data sets includes using publicly recognizable identifiers for biological data objects, using accepted nomenclature to describe the data set, using standard formats for data files, and linking the identifiers of reagents submitted to stock centers to the identifiers used in data files. Whenever possible, data providers should make arrangements to collaborate with existing repositories and stock centers rather than implementing entirely new information resources.

5. Advance comparative biology. In order to avoid an unsustainable proliferation of species-specific databases, and to encourage the emerging discipline of comparative genomics, we recommend funding support for mergers and cooperative agreements among existing and proposed databases that allow for comparisons among species and for the integration of multiple types of data.

6. Separate the technical infrastructure from the human infrastructure. Many automated computational tasks do not require specialized species- or clade-specific knowledge. These tasks include, for example, gene prediction, EST assembly, genome alignment, and protein family identification. In order to avoid redundant and inconsistent efforts, funding agencies should encourage partnerships between groups that can provide technical infrastructure for automated annotation tasks and groups that are skilled at curation.

7. Standardize data formats and user interfaces. The lack of standardization among related data sets causes inability to integrate and analyze data using fixed procedures. Data providers should be encouraged to use standard file formats whenever available. Data repositories should provide standard user interfaces in addition to any custom ones they wish to develop. When suitable standards do not exist, support should be available to develop them.

With a coordinated funding and training plan for database maintenance and curation, the rapidly growing volume of high-throughput genomic and functional information will retain its value to the biological community for many years to come.

lstein@the-scientist.com

The members of the Working Group:

Lincoln D. Stein Cold Spring Harbor Laboratory

William D. Beavis National Center for Genome Resources

Damian D. Gessler National Center for Genome Resources

Eva Huala Carnegie Institution of Washington

Carolyn J. Lawrence USDA Agricultural Research Service

Doreen Main Clemson University

Lukas A. Mueller Cornell University

Seung Yon Rhee Carnegie Institution of Washington

Daniel S. Rokhsar Lawrence Berkeley Laboratory

For a longer version of the Working Group's recommendations, see www.gramene.org/resources/plant_databases.pdf.