Gene Finding with Hidden Markov Models

made history in 1995 when it became the first free-living organism to have its genome completely sequenced.

By | March 28, 2005


Haemophilus influenzae made history in 1995 when it became the first free-living organism to have its genome completely sequenced. In the decade since, some 180 or so organisms have followed suit.

For every one of these genomes, the sequence is only the beginning. The challenge for the computational biologists charged with making sense of the data: to find the gene sequences hidden within those strings, billions of bases long, of As, Cs, Gs, and Ts. The genome annotation strategies these computer scientists cum biologists have developed clearly have come a long way. The most recent iteration (version 4.0) of the Drosophila genome annotation, for instance, updated only 25 predictions out of 13,472 protein-coding genes.

But improvements can still be made. "If they were 100% reliable, then they would have been run on the April 2003 complete human sequence and that would've been it. Those would have been your genes," says David Haussler, who directs the Center for Biomolecular Science and Engineering at the University of California, Santa Cruz. Instead, the human gene count has been downsized by nearly one-third, from 35,000 genes to closer to 25,000.

Much of the field's slow progress stems directly from the contradiction at the heart of the field of computational biology: the attempt to wed the logic of math to the messiness of biology. Algorithms can be simple, elegant, and fast, but they may not capture biology, which is often anything but simple.

"I think what people overlooked was the complexity of biology," says Zhirong Bao, coauthor with Sean Eddy of Recon, a software program for identifying repeat sequences. "They start with simple and beautiful graph theories and implement them in software, assuming that all their DNA sequences are going to behave like their perfect nodes in their perfect graphs."

Now as more genomes are sequenced, researchers are looking to bring their computations in line with the underlying biology. They are creating software that incorporates phylogenetics, the descriptions of evolutionary distance, into the field's favorite computational tool, the hidden Markov model (HMM).


Eddy, associate professor at Washington University in St. Louis and coauthor of the textbook, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press), has dubbed HMMs "the Legos of computational sequence analysis."1 HMMs are special instances of graphical models, which were originally developed by computer scientists studying machine learning and speech recognition. In technical parlance, says Eddy, HMMs "describe a probability distribution over an infinite number of sequences." To the uninitiated, they resemble a cross between a flow chart and a doodle. In order to understand conceptually how HMMs work, consider their origin in speech recognition, says Haussler. In that field, a computer is asked, given a speech wave, what are the phonemes (sounds) that it encodes. The wave is the measured signal; the phonemes are the "hidden" signals that give the HMM its name.

"There is a probabilistic relationship between phonemes," Haussler explains. "After a 'th' sound can easily come an 'r' or an 'ah' or several other types of sounds, but not, for example, a 'k' sound. A hidden Markov model for speech incorporates all possible phonemes, and for each phoneme the probability that it's followed by any other phoneme."

Haussler says the HMM also "models the stochastic relationship between each phoneme and the speech wave one might measure for it. In this way it can be used to infer the sequence of phonemes that best fits a given segment of recorded speech." Translating that to molecular biology, he explains, the measured signal is the sequence of nucleotides, while the hidden signal is their function. "Biology is trying to speak a language to us, and the HMM model is helping us to distinguish the phonemes of that language."

Structurally, says Lincoln Stein of Cold Spring Harbor Laboratory, "HMMs work when you have information that moves from left to right, [when] the probability of an event to the right is dependent on previous events to the left." That logical structure ensures that almost all annotation problems can be cast as HMMs, says Eddy. "You've got an observed residue, and there's one or more labels [or 'states'] that you can attach to it." He continues, "You don't know what the states are. So as you move from left to right, you ask, 'what state am I in,' which is 'what label do I attach to this residue now?"' So, just as in speech recognition, there is a measured signal (the residue) and a hidden one (its label or function).

That's not to say HMMs will solve every problem, says Stein. Noncoding RNA genes, for instance, in which function is encoded both in sequence and in secondary structure, present a particular problem. "The problem with folding is that there's a kind of symmetry. You move left to right and then right to left, depending on where you are in the fold," he says.

Taking their cue from computational linguistics, biologists address these and similar problems using stochastic context-free grammars. SCFGs are really "just HMMs extended to a second dimension," says Eddy. "A lot of the concepts are the same in HMMs and SCFGs. SCFG algorithms are just a little nastier to work with, much more memory and compute time needed, which is the big thing limiting us from deploying SCFGs on every cool computational RNA problem we can think of."



Courtesy of Sean Eddy

This diagram shows the Plan7 architecture used in the gene finding software, HMMER. Squares indicate match states (modeling consensus positions in the alignment). Diamonds indicate insert states (modeling insertions relative to consensus) and special random sequence emitting states. Circles indicate delete states (modeling deletions relative to consensus) and special begin/end states. Arrows indicate state transitions.

Molecular biologists use the term generically, yet many classes or implementations of HMM exist, each of which is adapted to particular annotation challenges. Eddy's own work concentrates on identifying structural and catalytic RNAs. His lab uses a subclass of HMMs called profile HMMs, which were presented in 1992 by Haussler and Anders Krogh as an alternative to profiles, the then-popular approach to protein classification.

"A profile is the technical term for the model used to represent a family of related proteins," says Haussler. Essentially, he says, it is "a computer trick" for deciding how to classify a given protein. But profiles turned out to be incomplete models, and Krogh and Haussler's innovation was to attack the same problem using HMMs. They were not the first researchers to apply HMMs to biology, but Krogh and Haussler's work2 on profile HMMs electrified the field. "Their technical report flew through the community," Eddy recalls. Today, profile HMMs drive the popular protein sequence-analysis software, HMMER.

Now Haussler, along with other researchers in the United States and Europe, is again generating excitement in the annotation community. This time the hubbub surrounds so-called PhyloHMMs, implementations that combine HMMs with phylogenetics to drive better gene prediction.3 Among their more popular current incarnations is the "conservation graph" on the UCSC Genome Browser. Haussler's graduate student, Adam Siepel, created the conservation graph.

PhyloHMMs should help with the "sequence weighting" problem that has plagued earlier attempts to use HMMs to extract information from multiple sequences. "If you just make a naïve kind of assumption about these sequences all being independent, you introduce biases into your analyses," explains Siepel.

The original workaround was to assign arbitrary weights. "People came up with these ad hoc ways of adding weights," says Siepel. For instance, if the human genome was being compared to those of the chimp and the mouse, the researcher might assign less weight to the chimp sequence than the mouse, because the two primates are so closely related that it can be difficult to identify which sequences are under evolutionary pressure. Eddy says that these models have so little genuine statistical basis that he calls them "sequence weighting crap."


© 2005 Nature Publishing Group

This "toy" HMM is trying to find the 5' splice site (5) in a sequence containing an exon (E)-intron (I) boundary. (Reprinted with permission from S. Eddy, Nat Biotechnol, 22:1315–6, 2004.)

By including phylogeny in the model, though, the weighting begins to make biological sense, says Eddy. "With sequence-to-sequence phylogenetic correlation, because the species are actually related to each other, the number of parameters goes down. There's a phylogenetic tree that now says how these sequences are related to each other." Still he warns, "The mathematics is a complicated beast, but this is where the field has to go."

Haussler and Siepel4 have used PhyloHMMs to create software they call "ExoniPhy," which detects new, novel sequences that are conserved from genomes by modeling how codons evolve. "We're trying to find genes that are not supported by cDNA evidence, not supported by sequenced mRNAs or ESTs," says Siepel, "Many genes, we believe, are being missed by the experimental techniques that are used to fish out mRNAs. One of the problems is that these genes are very rarely expressed or they're expressed at very low levels, so for example you might have a gene that's turned on in the ninth day of development for a few hours and then afterwards it's turned off, and so it never shows up in these libraries of mRNA."

A phylogenetic approach assumes that a critical gene, even a rarely expressed one, will be highly conserved. "Even if it's hard to find the mRNA, it's going to have an evolutionary signature in the genome, so we're trying to use these evolutionary footprints to help find genes that haven't been able to be found," concludes Siepel. He says his team believes it has detected "at least several hundred genes" that are strongly supported by the evolutionary/cross-species evidence, but they are not supported or only very weakly supported by cDNA evidence. They are now working on validating their predictions in the lab.


Other researchers, including Michael Brent, are trying similar approaches. Brent, a colleague of Eddy's at Washington University, created TWINSCAN, a gene-prediction tool that works by comparing two genomes. His lab is now working on a TWINSCAN update dubbed N-SCAN.5 As the name implies, N-SCAN allows comparisons between unlimited numbers of genomes.

Tests of the program have been encouraging, Brent says. "With N-SCAN, we're now up to more than one-third of the known genes [in the human genome] predicted exactly right. With more compact genomes like worms, we can get more than 60% of the known genomes exactly right."


Courtesy of Adam Siepel

The conservation graph from the University of California, Santa Cruz, human genome browser is based on a PhyloHMM. This plot shows the known gene structure alongside the computer gene predictions and genomic comparisons with other species for the MET proto-oncogene. Boxes represent exons, while lines with chevrons represent introns.

Like ExoniPhy, N-SCAN combines HMM work with phylogenetic concepts, but Brent cites important distinctions. "The ExoniPhy model is based on an improvement to the classical continuous-time Markov chain model of DNA evolution. The key to that is that you have a phylogenetic tree which has branch lengths that represent the amount of evolution that's gone on since the common ancestor, but the pattern of evolution is the same throughout the entire tree."

In other words, the model assumes that the tolerance for substituting a C for a T in a particular position doesn't change throughout evolutionary history. Instead, what changes is how much time, hence opportunity, there has been for such a substitution to actually occur. (Haussler says Siepel is currently writing code that relaxes that assumption.)

By contrast, says Brent, "Our model is more flexible; it doesn't have that assumption. It allows for one branch to develop a separate pattern of evolution at a given site from another branch." Nevertheless, he praises ExoniPhy as "elegant" and says, "It has a higher specificity, a lower false-positive rate on exons than anybody else."

Such subtle distinctions ensure that no one program will be best suited for every application, says Lior Pachter, a mathematician at UC-Berkeley, and author of SLAM,7 which does simultaneous gene finding and sequence alignment. Indeed, the very idea of comparing various programs amuses Pachter. "People like to write programs like SLAM or TWINSCAN and then they compare them to other programs, and always in their comparisons they come out getting better results," he says. "But of course, it's impossible that everybody has the best program."

From his experiences working on the mouse genome, Pachter observes, "There are certain genes that have certain properties that make them easy to predict and then we all agree, but there are a lot of cases where we disagree." He advises biologists, "It's always a good idea to try all the available tools. It's not always that one program is universally better than the other in every case."

Consider phylogenetics-based programs, for instance. Roderic Guigó of the Universitat Pompeu Fabra in Catalonia, Spain, developer of the comparative gene-prediction program called SGP2,6 notes that the phylogenetics data that make PhyloHMM-based programs so powerful are also their Achilles' heel. "While in general more accurate than purely ab initio methods," he writes in an E-mail, "a drawback of the PhyloHMM gene predictors is that they require the sequence of one or more genomes at the appropriate phylogenetic distance: C. elegans and C. brigasse, human and mouse, Fugu and Tetraodon, etc. [On occasion], such a second sequence does not exist."

One thing is certain: Even when these computer tools are finally perfected, biology will still be done at the bench, not at the keyboard. "Ultimately all computer predictions must be tested in the laboratory," says Haussler, "The more you predict, the more wet lab validation there is to be done." While lay people may find it ego-deflating that the gene set keeps shrinking, experimental biologists may find it something of a relief.

Popular Now

  1. Publishers’ Legal Action Advances Against Sci-Hub
  2. Metabolomics Data Under Scrutiny
    Daily News Metabolomics Data Under Scrutiny

    Out of 25,000 features originally detected by metabolic profiling of E. coli, fewer than 1,000 represent unique metabolites, a study finds.

  3. How Microbes May Influence Our Behavior
  4. Decoding the Tripping Brain