Following Phylogenetic Footprints

THE POWER OF PHYLOGENY:Studies of the β-globin gene promoter illustrate the power of phylogenetic footprinting, and the importance of species choice in that analysis. (a) Transcription factor analysis of the human promoter without footprinting reveals numerous predictions, most of which are biologically irrelevant. (b) Comparison with the chicken promoter fails to detect conserved sites, but comparison with the mouse promoter does (c), including a documented GATA-binding site (boxed). (d) C

Jeremy Peirce

Studies of the β-globin gene promoter illustrate the power of phylogenetic footprinting, and the importance of species choice in that analysis. (a) Transcription factor analysis of the human promoter without footprinting reveals numerous predictions, most of which are biologically irrelevant. (b) Comparison with the chicken promoter fails to detect conserved sites, but comparison with the mouse promoter does (c), including a documented GATA-binding site (boxed). (d) Comparison with the cow promoter identifies still more conserved sites, but comparison with the Macaque monkey promoter (e) is little better than no filtering at all. (Reprinted with permission, B. Lenhard et al., J Biol, 2:13, 2003.)

Scientists know that the regulatory elements that guide and control gene expression, for the most part, lie not within coding sequences but outside and between them. Now researchers are taking their search for these sequences genome-wide. And with hundreds of completed genomes...


Phylogenetic footprinting involves the aligning of orthologous sequences (from equivalent genes in different species) to find noncoding regions (i.e., one or more TFBS) that have withstood the ravages of evolutionary time. Since TFBS are short and often degenerate, they are difficult to identify directly. Instead, longer conserved regions are identified and examined. The working assumption at this resolution is that functional elements should reside in conserved regions.

This assumption may not apply equally to all systems, however. Using phylogenetic footprinting to examine the well-studied regulatory regions of fly early-patterning genes, Eric Siggia of Rockefeller University in New York found that "simply filtering by interspecies conservation will give an incomplete account of experimentally known binding sites." Since only slightly more binding sites than expected by chance were also part of conserved regions, Siggia concludes that the effectiveness of phylogenetic footprinting at comprehensively identifying regulatory sequences "may depend very much on the system, and it remains to be shown that it gets all the regulation for a gene."1


Awide variety of software tools are related to phylogenetic footprinting, and a number of approaches can integrate other forms of data to improve predictions. Martin Tompa of the University of Washington, Seattle, has developed software known as FootPrinter, which uses available sequences, the phylogenetic tree relating them, and an algorithm for de novo motif discovery to search directly for short, conserved motifs.

ConSite, developed by Wyeth Wasserman's group at the University of British Columbia, is a Web-based program that identifies conserved regions and uses known binding-site characteristics to identify active sites. Another tool from his group, oPOSSUM, helps integrate footprinting with microarray data. Microarrays can help to identify coexpressed, and therefore possibly coregulated genes, Wasserman explains, and oPOSSUM helps to identify types of binding sites overrepresented among these genes.

Choosing the genomes to be compared is a major consideration in designing phylogenetic footprinting experiments. More evolutionarily distant species tend to share less nonconserved sequences and thus have more power to detect conservation. "If you're looking for a pattern that has been well conserved between all or many of the organisms, you would be more surprised if two distant species shared the pattern than if two closely related species shared the pattern," says Martin Tompa of the University of Washington, Seattle.

A balance between shared biology and evolutionary distance is necessary when choosing species for comparison. "If you are looking for conserved regions, you have to be confident that the organisms have a shared regulation," says Wasserman. "If you're doing a chicken-human comparison of an organ that's just not present in the chicken, even if you can find the right ortholog it may not be a great thing to do."

But, Wasserman continues, "Once you're sure about the orthologs and regulation, you want to maximize the distance that will allow you to get a good alignment. Right now that's a by-eye problem. I suspect in the not-so-distant future that will be much more quantitative, and you will throw in all the genomes." He concludes, "The bioinformatics has to catch up with the number of genomes."

So is there an optimal set of species to compare? Evidently not. According to Wasserman, sequence diversity also varies within genomes. "Genes are evolving at different rates, so there's no global statement that a pair of genomes are appropriate or inappropriate for comparison," he says. "You have to go by local characteristics."

Dario Boffelli, staff scientist at the Lawrence Berkeley National Laboratory in California, mentions another concern. "Conserved expression can be achieved in a number of different ways. After the split from their last common ancestor, each lineage may accumulate different changes and compensatory changes, but [the regulatory regions] may do the same thing. So human and mouse can be doing the same thing [i.e., using similar regulatory strategies], but with sequences you can't identify by conservation."


Comparing more closely related genomes could solve the problem. Boffelli and colleagues have developed a technique23 that partly overcomes the limitations accompanying closer comparisons. They used their method, phylogenetic shadowing, to compare multiple primates. "The basic ideas underlying phylogenetic shadowing and phylogenetic footprinting are very similar," Boffelli explains. "The idea specific to shadowing is the focus on species with a close phylogenetic relationship and on using a more sophisticated model [of conservation]." Shadowing assumes that important sequences will be strongly conserved among closely related species and eliminates less well-conserved regions.

Total tree length, a measure of the evolutionary distance between compared genomes and indirectly of experimental power, is approximately "the same between human and mouse and between human and primates with seven primates," Boffelli notes. The potential power of the technique is limited, however, by the overall diversity of primates. "If you sequence more you only see a very small increase," Boffelli explains. "Five to seven [sequences] capture about 85% of the variation."

While that number of sequences is fairly easy to acquire for individual genes, full genomes are another story. The chimpanzee genome is available and the macaque genome is expected soon. But according to Boffelli, "the chimp is too close to human to be of any use for shadowing." He adds, "What we are really missing are new-world monkey genomes. Since these are the most distant from human, they contribute the most to the analysis."

In addition to facilitating research on primate-specific genes, working with closely related species simplifies modeling. "Using closely related species means the [phylogenetic] trees have much higher reliability, and consequently all the mutation rate estimations are much more accurate," says Boffelli. This allows precise evaluation of the likelihood that a particular region has been conserved by evolutionary pressure, a difficult task in more distant comparisons. Boffelli contends that conserved functions more likely will be detected as conserved sequence because of less opportunity for divergence and compensatory change.


Even with an extensive collection of genomes, says Boffelli, "you can only compare the conserved biology. You often define the types of things you can discover based on the kind of organisms you study." For biology conserved between humans and other mammals, plenty of sequence diversity can be tapped. Stanford geneticist Gregory Cooper and colleagues4 estimate that approximately four times the diversity available in the human, mouse, and rat genomes – a goal reachable using fewer than 20 mammalian species, by Boffelli's estimation – would give phylogenetic footprinting experiments single-nucleotide resolution.

Elliott Margulies, a National Human Genome Research Institute (NHGRI) research fellow, and other researchers5 agree that more genomes will improve results. "We have done analyses that show you still make incremental gains in specificity for detecting sequences under purifying selection out to 16 and 17 species," Margulies notes.


Recently a group led by David Haussler, a Howard Hughes Medical Investigator at the University of California, Santa Cruz, used phylogenetic footprinting to identify 481 long, highly conserved regions in the human genome.6 Measuring 200 to 800 base pairs, these regions are perfectly conserved between humans, mice, and rats.6 Wasserman calls the observation "tremendously interesting."

These long stretches do not fit conventional biological paradigms of conservation. More than half of the ultraconserved regions are not associated with genes at all; the others often overlap both coding and noncoding regions, according to Haussler. Prior wisdom held that noncoding conserved regions are generally much shorter, because protein-binding sites tend to comprise only a few base pairs. "That's the mystery," Haussler says. "Why would they be conserved at such a high level over such a long [evolutionary] distance?" Indeed, 29 regions were entirely conserved between human and chicken, which are thought to have last shared a common ancestor an estimated 300 million years ago.

However, as postdoctoral fellow and first author Gill Bejerano notes, only the cores of some ultraconserved regions were present in fish, and Haussler's group " [was] not able to show the existence of ultraconserved regions for anything simpler than fish, including fly, sea squirt, and Caenorhabditis elegans." This, say both Haussler and Bejerano, suggests that the identified regions are particular to vertebrates, though other lineages may have their own analogous sequences. Bejerano says the Haussler group is collaborating with others to determine if any of these sequences function as distal enhancers of genes important in development.


Recently NHGRI launched an initiative, the Encyclopedia of DNA Elements (ENCODE), to identify all functional elements in 1% (30 million base pairs) of the human genome.7 Not surprisingly, phylogenetic footprinting plays a large role in the work. According to program director Peter Good, the consortium is designed such that "all investigators agree to work on the entire ENCODE region, rather than cherry-picking regions ... and agree to the rapid data-sharing requirements." The focus on a common fragment of the genome is intended to help facilitate exhaustive study, collaboration, and resource development, including extensive sequencing efforts.

ENCODE will take advantage of both available sequence and sequence generated specifically for the project. "For the ENCODE regions we will have the best sequence available, so comparative genomics will be important," Good says. According to Margulies, "Our group plans to sequence [the ENCODE regions in] roughly four species per year to a comparative-grade level of finishing." In addition, the NHGRI recently announced a sequencing initiative that will include nine mammalian and nine nonmammalian genomes, largely selected based on their usefulness for comparative genomics.8 The mammalian group will include animals ranging from the rabbit to the African savannah elephant.

Good has high hopes. "When ENCODE is complete, the consortium members will have identified a lot of interesting biology and will have learned how to identify many interesting functional elements. What NHGRI is looking for is a path forward for how we're going to do this for the other 99% [of the genome]."

That path likely will continue to involve phylogenetic footprinting and related techniques, and given the speed and increasingly modest cost of sequencing new genomes, that role will become only stronger over time.5 As Margulies observes, "With every new genome that becomes available, we are making unprecedented gains in decoding the functions of vertebrate genomes."


Ultimately, however, any factor identified by a computer must be verified at the lab bench. Indeed, the ENCODE initiative sets aside funds for large-scale and exploratory applications of experimental techniques for rapidly identifying functional elements. According to Wasserman, these wet-lab techniques are important for biological validation and for gathering tissue-specific and temporal information that phylogenetic footprinting cannot distinguish. "The ultimate goal is to have a set of known regulatory regions that direct expression to the cell type of interest," Wasserman says.

One popular approach to gaining such validation is called ChIP-chip (or ChIP-on-chip). Blending chromatin immunoprecipitation (ChIP) and genomic microarrays (or "DNA chips"), the technique, says Xiang-Dong Fu of the University of California, San Diego, "provides a snapshot of the protein-binding status of DNA at a particular moment, potentially for the entire genome." This snapshot, he adds, captures proteins related to chromatin structure as well as those engaged in regulatory tasks.

With the ChIP-chip method, proteins are reversibly crosslinked to fragmented DNA followed by addition of a factor-specific antibody to precipitate the complex. After precipitation the protein-bound DNA is released, fluorescently labeled, and applied to a genomic microarray to map its position.

The resulting data depends on the antibody used. Antibodies to "sites of histone modifications can give us a good idea of the locations of many elements," Dunham explains. In contrast, "using antibodies for specific transcription factors can mark a limited set of elements but with much more information about function."

"ChIP-chip and other methods put some experimental data on top of the bioinformatic analyses," says Dunham. "Essentially it would be nice to identify regions that are conserved between species and [that] can be shown to have a function such as binding a particular set of proteins in vivo." Experimental results also provide feedback for improving computational algorithms. "In the end," Good agrees, "You have to do wet-lab validation of any computational approach. The wet-lab result validates the computational approach and is also important for improving [it]."

Jeremy L. Peirce

Interested in reading more?

Magaizne Cover

Become a Member of

Receive full access to digital editions of The Scientist, as well as TS Digest, feature stories, more than 35 years of archives, and much more!
Already a member?