Comparative Genomics on the Rise

Simple, fast-growing, and sexually reproducing, yeast have been a stalwart model for generations of geneticists.

Nicole Johnston(njohnston@the-scientist.com)
Jun 5, 2005
<p>REJECTED:</p>

© 2003 Nature Publishing Group

Aligned nucleotides of the spurious ORF, YDR102C, are shown as stacked squares for the four species compared (S. cerevisiae, S. paradoxus, S. mikatae, and S. bayanus, respectively). Green represents a conserved position, yellow otherwise. In addition to the alignment gaps (white), and the abundant frame-shifting insertions (red), numerous in-frame stop codons are observed in the other three species. (From M. Kellis et al., Nature 423:241–54, 2003.)

Simple, fast-growing, and sexually reproducing, yeast have been a stalwart model for generations of geneticists. The first eukaryote sequenced nearly a decade ago, and amenable to high-throughput techniques, they have also led the charge in genomics study. Two papers published in 2003 affirmed a new claim in computational biology, marking the first time multiple eukaryotic genomes were aligned completely. Doing so set a new standard in comparative sequence analysis, with inevitable implications for annotating the human genome and peering into evolutionary history.

This issue's Hot Papers showcased the enormous power afforded by comparative sequence analysis. A group from genome-sequencing centers in Boston developed new computational techniques to deal with the task of aligning four whole genomes in search of hidden regulatory elements.1 Another group from the sequencing center at Washington University in St. Louis took conventional computational techniques to a larger sample of genomes.2While the concept isn't new, the approaches used in these papers elegantly show how much high-precision annotation can be achieved with comparative genomics. These papers "are early indicators of how powerful these methods are," says Maynard Olson, genome scientist at the University of Washington, Seattle.

AWESOME POWER

In the first paper,1 Manolis Kellis and Eric Lander, along with colleagues at the Whitehead Institute, MIT, and Harvard University in Boston, sought to identify regulatory motifs – binding sites for transcriptional regulators – among four species of Saccharomyces, and characterize their evolution. To do a comparative analysis against the high-quality, already annotated Saccharomyces cerevisiae genome, they sequenced the genomes of the sensu stricto ("strict sense") related species Saccharomyces paradoxus, Saccharomyces mikatae, and Saccharomyces bayanus – chosen for their high degree of relatedness. The resulting genome alignment enabled them to define and characterize regions of evolutionary change.

Kellis, a computational biologist, then devised a method for accurately identifying both genes and regulatory motifs. The team identified 72 regulatory elements including several new motifs. Their results revealed the yeast genome to contain approximately 500 fewer genes than originally thought – paring down the count by roughly 15%.

"Motif conservation at the genome-wide level had not been asked before; therefore, we were unable to use available programs," says Kellis. "I wrote everything from scratch; I had to develop programs to figure out how to align the genomes." Doing so revealed astonishing detail, recalls Kellis. "To me, the motifs just stood out. The level of refinement and resolution that we had was just unprecedented," he says.

Data derived from the Science Watch/Hot Papers database and the Web of Science (Thomson Scientific, Philadelphia) show that Hot Papers are cited 50 to 100 times more often than the average paper of the same type and age.

"Sequencing and comparison of yeast species to identify genes and regulatory elements," Kellis M, Nature , 2003 Vol 423, 241-54 (Cited in 241 papers, Hist Cite Analysis)

"Finding functional features in Saccharomyces genomes by phylogenetic footprinting," P Cliften Science 2003, 301:71-6 (Cited in 147 papers, Hist Cite Analysis)

Computational structures had to look at genomes as complete entities rather than comparing sequences as fragments. In addition to being able to more accurately identify genes, "we were able to define control elements of the dynamic cell and how genomes are changing across species," explains Kellis.

"It is the short sequence that is the main target of this analysis," says Olson. "The real push has been in regulatory sequences. There are many fewer constraints and they are much harder to recognize. They don't have a [particular] genetic code as a guide; often they're very short, not always perfectly conserved, and their position doesn't matter much. All these things make them harder to recognize."

Regulatory sequences commonly run six to 10 nucleotides; longer regulatory sequences tend to contain a collection of these short sequences. "In the last three to four years, there has been a real explosion in sequences of organisms related to genomes already sequenced," says Steve Salzberg, from The Institute for Genomic Research (TIGR), who wrote the accompanying commentary.3

What makes their work stand out, according to Salzberg, is that it was a different question from anything anyone had asked before, and it was the first time someone had gone and sequenced three genomes at once and aligned them to a fourth. "You can see the nonprotein coding regions; they're conserved in a way that makes it clear that it's not by chance and that there is some evolutionary constraint on that sequence," says Salzberg. "Experimentalists can learn a tremendous amount by studying evolution's laboratory notes," Lander adds.

In this issue's second Hot Paper,2 researchers at Washington University in St. Louis, led by Mark Johnston, took a somewhat similar approach using existing software. They aligned and characterized the sequences of six Saccharomyces species using the popular programs ClustalW and BlastX, searching for phylogenetic footprints and regulatory sequence motifs for identifying potentially functional sequences. Their analysis included the published S. cerevisiae genome and five related species, which they sequenced. The three sensu stricto species – Saccharomyces kudriavzevii, S. mikatae, and S. bayanus – were included to enable genome alignment with the available software, which is limited by its inability to align distantly related genomes. They added to the comparison two more distantly related species – Saccharomyces castellii and Saccharomyces kluyveri – with 33.9% and 54.5% identity, respectively. Including these two more distantly related genomes permitted them to determine the utility of sequences of varying relatedness for such analysis. Chromosomal rearrangements among the more distantly related species revealed short stretches of sequence syntenic with S. cerevisiae. Furthermore, it also provided greater detail when searching for conserved motifs, which would otherwise be obscured among highly related genomes.

Their conclusions echoed those of the MIT group. They predicted that 515 annotated genes were incorrectly identified, reducing the number of genes for S. cerevisiae to 5,773. Furthermore, they predicted 43 genes that went unnoticed in the original annotation.

IDEAL RELATIONS

"When you include two species that are much more distantly related, most of the junk then goes away," says Johnston. "But it's more challenging because you can't use off-the-shelf programs because the sequences are too distantly related to align. You need sequences of multiple species of the right evolutionary distance; 70–80% in the noncoding sequences," he says. In any pair-wise alignment, if they're too similar the motifs are hidden within sequence identity, but if one goes much below 70%, it's too hard to align them. He explains that 70–80% sequence identity is ideal, but requires multiples genomes, nonetheless. "We need more sequence data of more species – that's the take-home message," says Johnston.

<p>SACCHAROMYCES PEDIGREE:</p>

Courtesy of Paul Clifton and Mark Johnston

Alignments including some sensu stricto related species were used to identify conserved non-coding sequences by the group at Washington University, St. Louis. The more distantly related species S. castellii and S. kluyveri were included to help in identifying possible functional domains in protein sequence.

Genome scientist Eric Green, director of the intramural sequencing center at the National Human Genome Research Institute, Bethesda, Md., calls the papers an elegant demonstration of what comparative genomics can do. "These papers start to point to how we can use comparative sequence analysis for functionally characterizing mammalian genomes," notes Green.

David Kaback, of the Public Health Research Institute at the International Center for Public Health, Newark, NJ, says the approaches set a new standard for comparing sequence to define functional elements. "It's not always going to be right, but 90% of the time you probably will be able to mine something out of these synteny studies."

"It's clear now that this is how we're going to decipher the regulatory elements of the genome," says Johnston. His hope is that these papers will encourage others to write algorithms to address increasingly complex questions and alignments. His research interests, he notes, are more focused on experimental confirmation of their findings.

MOVING ON AND UP

Kellis recalls that many researchers had said their approach would only work in yeast and wouldn't be applicable to the genomes of higher organisms. In March, however, Kellis and colleagues published on the alignment of mouse, rat, dog, and human genomes generating a catalog of common regulatory motifs in both promoters and 3' untranslated regions.4

"Comparative sequencing and comparative genomics is a very hot area now," says Salzberg. TIGR, he says, is currently focusing on aligning and annotating three genomes for Aspergillus, a fungus that can infect the immunocompromised. Drosophila alignments are also in the works, as is alignment of multiple Escherichia coli genomes, although he says they don't anticipate finding many dramatic changes there. The National Institutes of Health is currently funding the sequencing of approximately a dozen mammalian genomes, which should offer more fodder for comparison with humans.

Throughout, these comparisons offer a window into the ways in which nature molds a genome over time. "Either [sequences] are beautifully conserved or they're chock-a-block full of frameshifts. There's barely a middle ground," says Lander. Ultimately, selection will serve as the guide to understanding. "How much more could evolution teach us about genomes? The answer is: a tremendous amount."