Feeding the Info Junkies

Getty Images In 1925, Drosophila pioneer Thomas Hunt Morgan and his students published their first compendium of fruit fly mutations. Known informally as the "Red Book," the catalog was continuously updated until 1992, by which time it had swelled to more than 1,100 pages. Drosophila needed to go digital. FlyBase went live in 1992. Today, this model organism database (MOD), just one of a growing number of such resources, logs about 30,000 hits per day. The curators of these MODs are biologist

Jun 2, 2003
Jeffrey Perkel
Getty Images

In 1925, Drosophila pioneer Thomas Hunt Morgan and his students published their first compendium of fruit fly mutations. Known informally as the "Red Book," the catalog was continuously updated until 1992, by which time it had swelled to more than 1,100 pages. Drosophila needed to go digital.

FlyBase went live in 1992. Today, this model organism database (MOD), just one of a growing number of such resources, logs about 30,000 hits per day. The curators of these MODs are biologists and bioinformaticians, who strive to integrate newly acquired genome-scale datasets, such as sequence, expression, and proteomic data, with more traditional biological information such as genetic maps, mutants, and phenotypes. The MODs' core mission, says WormBase developer Lincoln Stein of Cold Spring Harbor Laboratory, "is to make sure that anything interesting that gets reported in the literature finds its way into the database so that people can find it later."

MODs also act as arbiters of genetic nomenclature, and as sources of reagents and stocks. Many contain directories of researchers and labs in the organism's community, and, notes a report from a recent workshop, serve "as a nexus for discussions, announcements of interest to the community, and data submissions."1 MODs "not only center around a model organism but also around that research community," concludes Kara Dolinski, head curator and bioinformatics programmer at the Saccharo-myces Genome Database (SGD).

These publicly funded, freely accessible databases have accelerated science by delivering, with a simple mouse click, information that once required countless hours to track down and assimilate. Scientists can plan better-directed and more sophisticated experiments. "It certainly makes it easier to access more information," explains WormBase worker John Spieth, "and the more information you have, presumably, then you can better make hypotheses."

Frank Slack of Yale University says MODs have become indispensable tools. "If [WormBase] went away, we'd be like junkies without a fix." Yet the databases face constant informatics challenges, with scientists demanding access to new data types and cross-species information.

THE MINIMAL MOD Every MOD, says Stein, provides four basic features: the underlying database, its user interface, curational components, and annotation tools. From the user's perspective, the most important of these are the database and its external facade, the user interface. But those behind the scenes--the MOD squads--concentrate on annotation and curation. Annotation tools, says Stein, "help curators to figure out what the functions of genes are." This process requires a blend of computational wizardry and scientific know-how. Recently, a group of biologists and fledgling bioinformaticians annotated the entire Drosophila genome.2 It took 10 people almost a year to complete the task, says Suzanna Lewis, director of bioinformatics, Berkeley Drosophila Genome Center, where some of the work was done. The group developed computational tools and annotation rules as they went along, mixing published biological data with the results of alignment algorithms and gene-prediction software.

Curational tools detail procedures "by which scientific curators in the database find new articles that have to do with the organism, triage them to the appropriate expert, and then classify them," says Stein. At the currently major MODs --WormBase, FlyBase, Mouse Genome Informatics (MGI), SGD, Zebrafish Information Network, and Arabidopsis Information Resource--curators read every paper dealing with their particular organism and log pertinent information.

This process links the organism's biology with its genome sequence, and distinguishes MODs from archival resources such as GenBank, and from automated annotation systems such as EnsEMBL. It also allows researchers to ask remarkably complex questions. An MGI user, for instance, can request all genes on chromosome 2 that are expressed at Tyler stage nine, and which have been annotated as transcription factors. "We don't just store [the] consensus representation," adds Carol Bult, a principal investigator in the MGI consortium. "You can actually drill down and very often even look at the underlying experimental evidence that supports that assertion." This makes for extremely large databases: In February, says Stein, WormBase was 6.4 gigabytes, growing at a rate of 1.5 GB per month.

Such functionality isn't cheap. The well-funded MGI database has a staff of 50, says Bult: 30 curators and 20 software engineers. But most databases have leaner budgets, and hence, fewer bells and whistles. "There are a lot of organisms in the world," says David Roos, who oversees ToxoDB and PlasmoDB, the parasite MODs; the money simply isn't there to curate every one of them. So, some MOD developers are considering automated curation procedures, like those used at EnsEMBL. "We can't rely just on our manual curation," says Dolinski, "because it will be impossible to keep up."

REINVENTING THE WHEEL As new genome sequences become available, the research communities surrounding those organisms clamor for new MODs. The NIH recognized that building new databases from scratch would waste time, money, and effort, so in 2000 it established the Generic Model Organism Database (GMOD) project. Its purpose: Extract useful features from existing databases, document them, and make them generic, portable, and publicly accessible.

The GMOD project has released several components, including a generic genome browser, a genome annotation editor and pipeline, an insertional mutagenesis database, a comparative mapping tool, a standard operating procedure editor, and a literature curation tool. Other modules are in various stages of development, says GMOD team leader Stein, such as a microarray-management system.

Focusing on reusability already has paid dividends. Toshiaki Katayama, Institute for Chemical Research, Kyoto University, Japan, recently incorporated GMOD's generic genome browser into the Kyoto Encyclopedia of Genes and Genomes (KEGG) database; the new system was up and running in just days, he says. Stein also recounted a similar time-saving experience. "That's something that just would not have been possible without a flexible piece of software available that didn't make any particular assumptions about the organism," he says.

A recent conference suggests that GMOD's goals are catching on. Says Stein: "After the morning session a number of groups announced that they were no longer going to pursue projects that they've been working on and [would] instead use the projects created by other groups. Although this sounds awful ('We're abandoning our work!'), it's actually great news, because it means that we're achieving our goals of encouraging software reuse."

TO-DO LIST Maintaining the MOD is a full-time operation. Database managers need to keep the data current, says Spieth, and they need to maintain the annotations as gene-prediction algorithms improve. Incorporating new data types is another thorny issue. "I think one of the problems that all model organism databases have to face is scope," says Bult. Sometimes it makes more sense for a database to link to external sources that can handle new datasets more efficiently than to try to incorporate the data itself, she says. "Can one database really represent all of biology?" Bult asks. "I think the answer is no."

But the biggest challenge, says Spieth, is providing a way for users to make cross-species comparisons. It is relatively simple, if laborious, to link records in one database to related records in another by hand. But research would really benefit, some say, if it were possible to query multiple MODs at once. Researchers have devised several approaches to this problem.

ACROSS AN INFORMATION DIVIDE The first option is to populate a secondary database with data from primary resources. The euGenes database, containing information from many databases (human, mouse, mosquito, Drosophila, Arabidopsis, Caenorhabditis elegans, Saccharomyces cerevisiae, and zebrafish), summarizes common data types and excludes the differences, allowing the user to see a consistent interface. It is, says developer Don Gilbert from Indiana University, "a little step on the way to integrating a lot of important data."

A second option is to annotate database records using a consistent vocabulary, so that every MOD speaks the same biological language. The Gene Ontology (GO) Consortium, for example, defines terms for biological processes, cellular components, and molecular function. Avoiding the concept of orthology, which Bult describes as "a kind of a sticky wicket," the GO deals instead with function. Other lexicons also exist, such as the Plant Ontology, which defines terms for plant anatomy and growth stages.

A third approach uses defined protocols for data sharing among information sources. Stein's distributed annotation system (DAS), for instance, "allows a single machine to gather up genome annotation information from multiple distant Web sites, collate the information, and display it to the user in a single view," explains the DAS Web site. Several MODs implement DAS, including the Whitehead Institute's OmniGene system.

OmniGene creator Brian Gilman explains: "Every MOD provider uses a different paradigm for producing [its] data. They also produce different interfaces ... for their end customer." Therefore, biologists must navigate different Web pages, each with its own idiosyncrasies, to make cross-species comparisons. "OmniGene ... allows you to work within a framework to get the data out of these databases ... and get a common interface to all of them."

Such projects are unlikely to diminish the role of traditional MODs, says BioMOBY creator Mark Wilkinson. "I'm sure they'll still be accessed through the traditional CGI [common gateway interface]." Systems like BioMOBY and OmniGene, he says, "give people a way to get into those databases, do ad hoc mass queries ... without ever having to cut and paste."

Jeffrey M. Perkel can be contacted at jperkel@the-scientist.com.

References
1. "Report from the NIH/NIAID/Wellcome Trust Workshop on Model Organism
Databases," Bethesda, Maryland, April 29-30, 2002; available online at www.gmod.org/GMOD_Report.shtml

2. S. Misra et al., "Annotation of the Drosophila melanogaster euchromatic genome: A systematic review," Genome Biol, 3:research0083.1-22, Dec. 31, 2002; available online at genomebiology.com/2002/3/12/research/0083