© 2003 Nature Publishing Group
Aligned nucleotides of the spurious ORF, YDR102C, are shown as stacked squares for the four species compared (
Simple, fast-growing, and sexually reproducing, yeast have been a stalwart model for generations of geneticists. The first eukaryote sequenced nearly a decade ago, and amenable to high-throughput techniques, they have also led the charge in genomics study. Two papers published in 2003 affirmed a new claim in computational biology, marking the first time multiple eukaryotic genomes were aligned completely. Doing so set a new standard in comparative sequence analysis, with inevitable implications for annotating the human genome and peering into evolutionary history.
This issue's Hot Papers showcased the enormous power afforded by comparative sequence analysis. A group from genome-sequencing centers in Boston developed new computational techniques to deal with the task of aligning four whole genomes in search of hidden regulatory elements.1 Another group from the sequencing center at Washington University in St. Louis took conventional computational techniques to a larger sample of genomes.2While the concept isn't new, the approaches used in these papers elegantly show how much high-precision annotation can be achieved with comparative genomics. These papers "are early indicators of how powerful these methods are," says Maynard Olson, genome scientist at the University of Washington, Seattle.
In the first paper,1 Manolis Kellis and Eric Lander, along with colleagues at the Whitehead Institute, MIT, and Harvard University in Boston, sought to identify regulatory motifs – binding sites for transcriptional regulators – among four species of
Kellis, a computational biologist, then devised a method for accurately identifying both genes and regulatory motifs. The team identified 72 regulatory elements including several new motifs. Their results revealed the yeast genome to contain approximately 500 fewer genes than originally thought – paring down the count by roughly 15%.
"Motif conservation at the genome-wide level had not been asked before; therefore, we were unable to use available programs," says Kellis. "I wrote everything from scratch; I had to develop programs to figure out how to align the genomes." Doing so revealed astonishing detail, recalls Kellis. "To me, the motifs just stood out. The level of refinement and resolution that we had was just unprecedented," he says.
Data derived from the Science Watch/Hot Papers database and the Web of Science (Thomson Scientific, Philadelphia) show that Hot Papers are cited 50 to 100 times more often than the average paper of the same type and age."Sequencing and comparison of yeast species to identify genes and regulatory elements," Kellis M, Nature , 2003 Vol 423, 241-54 (Cited in 241 papers, Hist Cite Analysis)"Finding functional features in
Computational structures had to look at genomes as complete entities rather than comparing sequences as fragments. In addition to being able to more accurately identify genes, "we were able to define control elements of the dynamic cell and how genomes are changing across species," explains Kellis.
"It is the short sequence that is the main target of this analysis," says Olson. "The real push has been in regulatory sequences. There are many fewer constraints and they are much harder to recognize. They don't have a [particular] genetic code as a guide; often they're very short, not always perfectly conserved, and their position doesn't matter much. All these things make them harder to recognize."
Regulatory sequences commonly run six to 10 nucleotides; longer regulatory sequences tend to contain a collection of these short sequences. "In the last three to four years, there has been a real explosion in sequences of organisms related to genomes already sequenced," says Steve Salzberg, from The Institute for Genomic Research (TIGR), who wrote the accompanying commentary.3
What makes their work stand out, according to Salzberg, is that it was a different question from anything anyone had asked before, and it was the first time someone had gone and sequenced three genomes at once and aligned them to a fourth. "You can see the nonprotein coding regions; they're conserved in a way that makes it clear that it's not by chance and that there is some evolutionary constraint on that sequence," says Salzberg. "Experimentalists can learn a tremendous amount by studying evolution's laboratory notes," Lander adds.
In this issue's second Hot Paper,2 researchers at Washington University in St. Louis, led by Mark Johnston, took a somewhat similar approach using existing software. They aligned and characterized the sequences of six
Their conclusions echoed those of the MIT group. They predicted that 515 annotated genes were incorrectly identified, reducing the number of genes for
"When you include two species that are much more distantly related, most of the junk then goes away," says Johnston. "But it's more challenging because you can't use off-the-shelf programs because the sequences are too distantly related to align. You need sequences of multiple species of the right evolutionary distance; 70–80% in the noncoding sequences," he says. In any pair-wise alignment, if they're too similar the motifs are hidden within sequence identity, but if one goes much below 70%, it's too hard to align them. He explains that 70–80% sequence identity is ideal, but requires multiples genomes, nonetheless. "We need more sequence data of more species – that's the take-home message," says Johnston.
Courtesy of Paul Clifton and Mark Johnston
Alignments including some
Genome scientist Eric Green, director of the intramural sequencing center at the National Human Genome Research Institute, Bethesda, Md., calls the papers an elegant demonstration of what comparative genomics can do. "These papers start to point to how we can use comparative sequence analysis for functionally characterizing mammalian genomes," notes Green.
David Kaback, of the Public Health Research Institute at the International Center for Public Health, Newark, NJ, says the approaches set a new standard for comparing sequence to define functional elements. "It's not always going to be right, but 90% of the time you probably will be able to mine something out of these synteny studies."
"It's clear now that this is how we're going to decipher the regulatory elements of the genome," says Johnston. His hope is that these papers will encourage others to write algorithms to address increasingly complex questions and alignments. His research interests, he notes, are more focused on experimental confirmation of their findings.
MOVING ON AND UP
Kellis recalls that many researchers had said their approach would only work in yeast and wouldn't be applicable to the genomes of higher organisms. In March, however, Kellis and colleagues published on the alignment of mouse, rat, dog, and human genomes generating a catalog of common regulatory motifs in both promoters and 3' untranslated regions.4
"Comparative sequencing and comparative genomics is a very hot area now," says Salzberg. TIGR, he says, is currently focusing on aligning and annotating three genomes for
Throughout, these comparisons offer a window into the ways in which nature molds a genome over time. "Either [sequences] are beautifully conserved or they're chock-a-block full of frameshifts. There's barely a middle ground," says Lander. Ultimately, selection will serve as the guide to understanding. "How much more could evolution teach us about genomes? The answer is: a tremendous amount."