What's Next for Bioinformatics?

I slipped into bioinformatics through the back door.

Lincoln Stein(lstein@cshl.edu)
May 22, 2005

I slipped into bioinformatics through the back door. In 1992, while completing my pathology residency at Brigham and Women's Hospital in Boston, I became frustrated with my inability to combine my love of biological research with my hobby of tinkering with computers and developing software. So when I was offered the opportunity to spend a summer working with the software group at the Whitehead Institute/MIT Center for Genome Research, I jumped at it.

I joined the field of bioinformatics just a year after the term first appeared in the published literature (though people had been doing computational sequence analysis for many years by then), and I have watched it grow from an esoteric research niche to a core discipline that biologists cannot conceivably do without.

The early bioinformatics databases emphasized primary data capture. GenBank, established in the late 1980s, began with staff at Los Alamos National Laboratory (and later the...


<p>Lincoln Stein</p>

A MOD's raison d'etre is to capture and integrate key facts about the biology of a species in browsable form. Practically every model organism has one. After leaving Boston I started work at Cold Spring Harbor where I helped establish WormBase, the Caenorhabditis elegans MOD.

WormBase http://www.wormbase.org illustrates the way MODs integrate different kinds of biological data. Each of the organism's 20,000-odd genes has a Web page that describes its sequence, genomic position, intron/exon structure, known mutants and resulting phenotypes, common polymorphisms, and tissue and stage-specific expression patterns. For a gene's protein product, WormBase identifies its amino acid sequence, functional domains and subcellular location, and known and predicted protein-protein interactions. At the bottom of the page, researchers will find links to stock centers and laboratories where they can obtain relevant reagents such as strains carrying mutations in the gene, as well as a chronological listing of all published papers with findings relevant to the gene's structure or function.

Though MODs are still going strong, their preeminence is now being challenged by multispecies, comparative-genomics databases, sometimes called clade-specific databases. These systems integrate information on multiple organisms and use comparative sequence analysis to discover patterns in the genome that might otherwise be missed. Well-known clade-specific databases include EnsEMBL at the European Bioinformatics Institute (EBI; http://www.ensembl.org), Entrez at the NCBI http://www.ncbi.nlm.nih.gov, and the Genome Browser at University of California, Santa Cruz http://www.genome.ucsc.edu, all of which relate information on the human genome to data gathered from other vertebrates, invertebrates, prokaryotes, and plants.

Some years ago my laboratory established Gramene http://www.gramene.org, a comparative genomics resource for crop grasses. This database integrates genome sequences, genetic maps, mutation, and trait data across rice, maize, wheat, and a large number of other cereals. Gramene gives researchers the benefit of genome sequencing even before their favorite organism actually has been sequenced.

The maize genome, for instance, is about the same length as the human genome, and won't be fully sequenced for another several years, but rice, with a compact genome one-tenth the size of human's, already is. Because the two grains are closely related evolutionarily, we have been able to create maps that relate maize's genetic map to the rice genome sequence. This allows researchers to follow a genetically mapped trait in maize, such as tolerance to high salt levels in the soil, and move into the corresponding region in the rice genome, thereby identifying candidate genes for salt tolerance. Similar techniques helped cattle researchers identify in 1997 a gene responsible for muscle growth based on the existence of a genetic mutation in a corresponding region of the mouse genome.1

As this trend continues, I predict that all MODs will either become clade-specific or will whither away. We can already see this trend in action. FlyBase http://www.flybase.org now contains two genomes: Drosophila melanogaster and D. pseudoobscura. WormBase contains information on both C. elegans and C. briggsae and is adding information about an additional three Caenorhabditis species over the next year.


The next big thing will be pathway databases. Traditional bioinformatics databases are linear catalogues of sequences, genes, proteins, genomes, and genome-to-genome alignments. Such databases have one or a small number of central data objects, such as a gene record, and all the other information hangs off that object.

Essential as this catalog functionality is, it is a far cry from what biological researchers ultimately want. A typical research talk recounts the story of an unfolding series of experiments, followed in the very last slide by "The Model," a pictorial summary of what the scientist thinks the experiments mean. The model describes the series of molecular events (the pathway) that is responsible for whatever phenomenon the researcher is studying, whether it be embryonic development, neuronal signaling in the brain, or the transformation of healthy cells into cancerous ones. These pathways are the ultimate output of biological research, the knowledge that gets published in papers, digested into textbooks, and used as the foundation for new generations of research projects.

Most biological pathways never make it into electronic databases, but instead languish in an increasingly crowded scientific literature. Researchers learn about their peers' findings from the literature and by attending meetings, but are typically able to keep up with only the literature in their own subspeciality. It's ironic, really: Just as comparative genomics is making it easier to make connections between genes that are traditionally studied in one model system with genes that are studied in another, researchers are struggling to access knowledge from outside their own discipline.

A new generation of pathway databases will address this gap. My current project, in collaboration with researchers at EBI and UC-Berkeley, is Reactome http://www.reactome.org, a curated collection of human biological pathways. Reactome consists of a group of curators and invited expert biologists who manually read through the scientific literature and create database records describing fundamental pathways of the human organism. Current entries describe energy metabolism, DNA replication, RNA transcription and splicing, protein translation, and cell cycle regulation, as well as more specialized pathways such as the blood-clotting cascade.

Each pathway is linked to the literature references that provide experimental support for it, and to the database records for genes, sequences, and proteins that participate in the pathways. A researcher can enter Reactome either at the pathway level and then drill down to get a list of the chemical reactions and protein interactions comprising the pathway; or by searching for a particular protein and discovering each of the pathways in which the protein participates. Leveraging the power of comparative genomics, Reactome includes automatically generated information on pathways in mouse, rat, chicken, and fish.

For bench researchers, Reactome acts as a sort of biological Cliff Notes. A researcher who has discovered that an unfamiliar gene participates in the system of interest can visit Reactome to obtain a quick description, along with diagrams and literature references, explaining the gene's role in various biological processes. More importantly, however, we designed Reactome to be a tool for researchers to ask questions about the organization of biological systems. As Reactome grows, we anticipate researchers will use it to measure the conservation of pathways across evolutionary time; look for correlations in the chromosomal positions of genes involved in common pathways; estimate the effects of a genetic mutation on a biological pathway; and identify feedback loops in genetically regulated pathways.

Reactome is not alone; other pathway databases exist, including HumanCyc at Stanford Research Institute http://www.humancyc.org, PANTHER Pathways at Applied Biosystems http://panther.appliedbiosystems.com, and BIND at the University of Toronto http://www.bind.ca. Just as GenBank moved rapidly from manual curation to fully automatic electronic submission, I expect that Reactome and other pathway databases will begin to accept electronic submissions over the next few years, possibly directly from authors or in collaboration with scientific journals. Indeed, given the convergent trend of scientific journals to publish online, we may even see a merging of the two efforts. Before the end of the next decade, pathway databases will become scientific journals and journals will become databases. Biologists will be greatly empowered, and bioinformatics will continue its long evolution.

Lincoln Stein is a professor of bioinformatics at Cold Spring Harbor Laboratory, New York, where he develops biological databases and the tools to run them. He is also an avid Web developer and author of several books on Internet software development, including Network Programming in Perl and How to Set Up and Maintain a Web Site.

He can be contacted at lstein@cshl.edu.

Interested in reading more?

Become a Member of

Receive full access to digital editions of The Scientist, as well as TS Digest, feature stories, more than 35 years of archives, and much more!
Already a member?