Your Database Is Talking; Is Anybody Listening?

During most of the 1990s, a linguistic chasm divided the worlds of flies, worms, mice, and other model organisms.

Amy Adams(aadams@the-scientist.com)
Sep 11, 2005
<p/>

During most of the 1990s, a linguistic chasm divided the worlds of flies, worms, mice, and other model organisms. People in one world remained largely ignorant about related genes and proteins being studied in the others, in part because each group stored data using its own peculiar vocabulary. Even within a single organism, a search for genes involved in "translation" might not pull up those described using the term "protein synthesis," and vice versa.

Michael Ashburner, a fly geneticist at Cambridge University, thought what the genetics field needed was a universal language to bring the data together. "It seemed to me self-evident that if all model organism databases used common language for describing gene products, then we'd be able to have some unification," he says.

His idea finally took hold, and in 1998 resulted in what is now the most widely used structured language, or ontology, to describe the biological world: the Gene Ontology (GO). Although the GO originally encompassed only fly, mouse, and yeast data, it is now broadly used by databases for most model organisms.

Seven years later, Ashburner is helping to guide the burgeoning field through growing pains brought on by the success of the original ontology. About 50 ontologies now comprise the Open Biomedical Ontologies (OBO) – also administered by Ashburner – and together make up a formal way of describing everything from human disease to animal natural history. These languages are internally consistent, but that's not necessarily true externally, ontology to ontology. What is needed is a way for these disparate ontologies to talk to each other – a biological lingua franca.

AN ONTOLOGY PRIMER

Like human languages, each ontology has a slightly different structure. But, the GO is a good model for how the languages are formed. Each term in the ontology has an identification number and a definition. For example, the term "aging" has the identifier GO:0016280 and the definition: "The inherent decline over time, from the optimal fertility and viability of early maturity that culminates in death and may be preceded by other indications such as sterility."

The definition may include synonyms, allowing a search for "translation" to pull up entries using the term "protein synthesis" instead, and terms are defined as being a "cellular_component," "molecular_function," or "biological_process." Aging is defined as a biological_process. Each term can also be related to other terms through the relations is_a and part_of. Thus, Aging is_a Development (GO:0007275) and is part_of Death (GO:0016265).

Other ontologies follow a similar approach, with some exceptions in the relational terms. The Mouse Anatomy Ontology uses only part_of, whereas the Drosophila Anatomy Ontology uses both relations from the GO plus the additional develops_from.

The recently released Sequence Ontology (SO),1 assembled in part by Ashburner and Chris Mungall from the University of California, Berkeley, who has helped develop several structured languages, takes the ontology one step further, using the terms difference or overlap to ask questions about the part_of relationship. The SO's primary goal was to unify gene annotations from different genomes so that a single set of tools could search and display results from any database. Some consistencies it cleared up include the placement of the stop codon, which may be part of the coding sequence in one annotation but part of the 3' untranslated region in others.

The additional terms allow the ontology to make relationships between parts of a whole. For example, two different transcripts may be defined as part_of the same gene. If the gene has three exons, then an exon found in both transcripts would overlap and be part_of both transcripts. Exons found in only one or the other transcript would be part_of the transcript and the gene, and would be a difference between the two transcripts. This additional information, Ashburner says, makes it possible to mine even more information from a database through the SO.

ONTOLOGIES IN THE LAB

The end result of this linguistic tinkering is something that remains largely invisible to the bench scientist. He or she simply knows that by entering a search term such as Aging into FlyBase, the database will retrieve all gene products that are defined as being involved in that process. And all the terms are in what appears to be plain English. In fact, inclusion in the OBO requires that the terms and their associated definitions be clear to the reader.

"I think we've found a middle ground that looks familiar to biologists, but is more systematized," says Midori Harris, GO editor at the Wellcome Trust Genome Campus in Cambridge, UK.

Other widely available tools, particularly those for gene expression analysis, make use of GO terminology. For any gene that is up- or down-regulated in a given sample, associated GO information adds context about what that gene's product does in a cell. Likewise, it's possible to find GO terms in common between clusters of differentially regulated genes, indicating that a particular group of genes are all involved in cell cycle or are all expressed in a given cellular compartment.

For now, GO is the only ontology in widespread use. But Ashburner expects the recently released Sequence Ontology and Cell Ontology will be integrated into tools to broaden their applications. By combining ontologies in this way, a scientist could, in theory, identify a gene with an interesting expression pattern, then follow that lead to related genes in other organisms via the Sequence Ontology, to cellular pathways in the Biochemical Ontology, and to cell type definitions within the Cell Ontology.

REWRITING THE DICTIONARY

For the moment, however, ontologies don't work together as seamlessly as they could. Barry Smith of the University at Buffalo points out, for example, that the GO defines Menopause as part_of Aging, and Aging as part_of Death. "If A is part of B and B is part of C, then menopause is part of death," he says. That relationship isn't true; certain diseases become more common after menopause, but the process itself isn't lethal.

Another problem is that, because relational terms have slightly different meanings in each ontology, it can be impossible to draw logical conclusions across two or more of them. The problem, according to Smith, is that the annotators' use of terms such as part_of and is_a isn't consistent. "One goal of ontologies is to link different sets of data about proteins, diseases, cell pathways, and so on. If each ontology uses the relations to mean slightly different things, then you can't link them together," Smith says.

What's needed to bring the fields together, Smith says, is a consistent structure for all of the biological ontologies. Smith, and a group including Mungall, described such a unification plan in a recent Genome Biology paper.2 This Relational Ontology (RO) includes 10 relations, each of which is rigorously defined: is_a, part_of, located_in, contained_in, adjacent_to, transformation_of, derives_from, preceded_by, has_participant, and has_agent. Though ontologies already in the OBO don't need to rewrite their terms, the GO is slowly beginning to update its terms. All new ontologies, however, must conform to the RO as a prerequisite for inclusion in the OBO, Ashburner says.

For the bench biologist, the conversion will mean applications that allow searches for, say, all dicistronic genes in mice, flies, and worms that are expressed in a given cell type – a search that would currently be impossible, says Ashburner. This potential for more powerful applications is what convinced both the GO and OBO to adopt the Relational Ontology.

In early June, the GO held an annotation camp for experienced annotators and people who have an interest in submitting new terms. Jim Zheng, from the Medical University of South Carolina, attended the camp to become better acquainted with the ontology for his own work designing an object-oriented computer language based on GO. This work could also apply to other ontologies and would allow programmers to design more complex applications modeling biological processes. He explains it could take a while for annotators to incorporate the new definitions in their work. "It is a very large task to revamp a finished ontology, but we have to do it, and it will be an ongoing process for a very long time," he says.

Jane Lomax, a GO editor at UC, Berkeley, who helped lead the camp, says bringing the GO in line with the Relational Ontology will be a long-term project. According to July 2005 statistics, the GO includes 19,247 terms: 9,960 biological_processes, 1,694 cellular_components, and 7,563 molecular_functions. But Mungall provided some help for ontology curators by developing Obol.3 Obol provides a way to parse ontologies, looking for inconsistencies, redundancies, and hidden relationships. "This work should allow us to more easily detect and fix inconsistencies in the GO – and obviously the more consistently the relations are used in GO, the more effective the reasoning can be," Lomax says.

Though most bench scientists will likely never know the changes that have transpired under the hoods of their favorite databases, the results could be big news: a closing of the chasm that once separated people working in different fields. "I think all bench biologists should be interested in efficiently discovering what is already known," says Ashburner.