I slipped into bioinformatics through the back door. In 1992, while completing my pathology residency at Brigham and Women's Hospital in Boston, I became frustrated with my inability to combine my love of biological research with my hobby of tinkering with computers and developing software. So when I was offered the opportunity to spend a summer working with the software group at the Whitehead Institute/MIT Center for Genome Research, I jumped at it.
I joined the field of bioinformatics just a year after the term first appeared in the published literature (though people had been doing computational sequence analysis for many years by then), and I have watched it grow from an esoteric research niche to a core discipline that biologists cannot conceivably do without.
The early bioinformatics databases emphasized primary data capture. GenBank, established in the late 1980s, began with staff at Los Alamos National Laboratory (and later the...
Though MODs are still going strong, their preeminence is now being challenged by multispecies, comparative-genomics databases, sometimes called clade-specific databases. These systems integrate information on multiple organisms and use comparative sequence analysis to discover patterns in the genome that might otherwise be missed. Well-known clade-specific databases include EnsEMBL at the European Bioinformatics Institute (EBI;
Some years ago my laboratory established Gramene
The maize genome, for instance, is about the same length as the human genome, and won't be fully sequenced for another several years, but rice, with a compact genome one-tenth the size of human's, already is. Because the two grains are closely related evolutionarily, we have been able to create maps that relate maize's genetic map to the rice genome sequence. This allows researchers to follow a genetically mapped trait in maize, such as tolerance to high salt levels in the soil, and move into the corresponding region in the rice genome, thereby identifying candidate genes for salt tolerance. Similar techniques helped cattle researchers identify in 1997 a gene responsible for muscle growth based on the existence of a genetic mutation in a corresponding region of the mouse genome.1
As this trend continues, I predict that all MODs will either become clade-specific or will whither away. We can already see this trend in action. FlyBase
THE PATH TO PATHWAYS
The next big thing will be pathway databases. Traditional bioinformatics databases are linear catalogues of sequences, genes, proteins, genomes, and genome-to-genome alignments. Such databases have one or a small number of central data objects, such as a gene record, and all the other information hangs off that object.
Essential as this catalog functionality is, it is a far cry from what biological researchers ultimately want. A typical research talk recounts the story of an unfolding series of experiments, followed in the very last slide by "The Model," a pictorial summary of what the scientist thinks the experiments mean. The model describes the series of molecular events (the pathway) that is responsible for whatever phenomenon the researcher is studying, whether it be embryonic development, neuronal signaling in the brain, or the transformation of healthy cells into cancerous ones. These pathways are the ultimate output of biological research, the knowledge that gets published in papers, digested into textbooks, and used as the foundation for new generations of research projects.
Most biological pathways never make it into electronic databases, but instead languish in an increasingly crowded scientific literature. Researchers learn about their peers' findings from the literature and by attending meetings, but are typically able to keep up with only the literature in their own subspeciality. It's ironic, really: Just as comparative genomics is making it easier to make connections between genes that are traditionally studied in one model system with genes that are studied in another, researchers are struggling to access knowledge from outside their own discipline.
A new generation of pathway databases will address this gap. My current project, in collaboration with researchers at EBI and UC-Berkeley, is Reactome
Each pathway is linked to the literature references that provide experimental support for it, and to the database records for genes, sequences, and proteins that participate in the pathways. A researcher can enter Reactome either at the pathway level and then drill down to get a list of the chemical reactions and protein interactions comprising the pathway; or by searching for a particular protein and discovering each of the pathways in which the protein participates. Leveraging the power of comparative genomics, Reactome includes automatically generated information on pathways in mouse, rat, chicken, and fish.
For bench researchers, Reactome acts as a sort of biological
Reactome is not alone; other pathway databases exist, including HumanCyc at Stanford Research Institute
Lincoln Stein is a professor of bioinformatics at Cold Spring Harbor Laboratory, New York, where he develops biological databases and the tools to run them. He is also an avid Web developer and author of several books on Internet software development, including
He can be contacted at