TREES FROM THE FOREST:
Software creates large samples of phylogenetic trees. How often a particular group of sequences occurs in the sample can indicate the amount of support for that group. The bootstrapping approach (A) involves generating pseudo-replicate data sets by re-sampling – with replacement – the sites in the original data matrix. Trees are produced from each of the pseudo-replicate data sets by performing tree searches or using a tree-building algorithm. The proportion of trees that contains a group is often interpreted as an assessment of repeatability – the probability that the grouping would be recovered with another data set. The Markov chain Monte Carlo methodology (B) does not alter the data. Instead a chain of trees is produced by starting from an initial tree, proposing random perturbations, and either accepting or rejecting these proposals according to specific rules (which take into account the likelihood of the proposed tree)....
Data derived from the Science Watch/Hot Papers database and the Web of Science (Thomson Scientific) show that Hot Papers are cited 50 to 100 times more often than the average paper of the same type and age.
TREES TAKING OVER
In the past, a relatively small group of scientists, mostly evolutionary biologists, was interested in organismal relationships. Now phylogenetic trees are used across the board. "It's become a standard tool in many disciplines," notes Fredrik Ronquist, coauthor of MrBayes.
"If you have a bunch of sequences, one of the standard things to do with them is make a tree," concurs Felsenstein. Many times the objective is not the tree itself, but to use the tree to understand the evolutionary relationship of particular characteristics across species. Scientists may use the tree to correlate a point at which a genetic change was fixed with phenotypic adaptations that occurred simultaneously. This, in turn, may help to establish the molecular basis for that phenotype.
Such analyses often examine whether nucleotide substitutions are synonymous (do not alter which amino acid is coded for) or nonsynonymous, and trace the history of gene-duplication events. This can yield information about the selective pressure, for example, that genes or even specific codons are under, or why speciation might have occurred.
The human diaspora has been modeled by reconstructing phylogeny from mitochondrial sequences derived from people of different ethnic origins.3 Phylogenetic modeling also has been used for projects such as predicting which strain of influenza is likely to hit next,4 or determining whether HIV likely came from chimps.5 The "basic purpose is to look at DNA and protein sequences in a comparative context," explains Kumar, now at the Biodesign Institute at Arizona State University.
Eric Schuettpelz, a graduate student at Duke University, Durham, NC, is interested in the relationships among ferns. In addition to allowing him to make "more natural" classifications, which can be useful for assessing biodiversity, his research focuses on the timing of major diversification events. For a recent paper, the group used MrBayes to estimate divergence times for ferns and angiosperms, based on molecular data. They were able to hypothesize, contrary to the prevailing view, that the diversification of ferns was "perhaps an ecological opportunistic response to the diversification of angiosperms, as angiosperms came to dominate terrestrial ecosystems."6
Since he works with protein-coding sequences from chloroplasts, Schuettpelz says, it's relatively straightforward to align the sequences (which lack introns) with their cross-species homologs and create "a matrix, with each row representing a species and each column representing a particular site in a gene sequence."
The data are fed into MrBayes, along with parameters such as the model of DNA substitution to be assumed, and how often trees should be sampled. Schuettpelz usually lets the program go through at least 107 iterations, each of which involves one or more changes to topology of the phylogenic tree it is querying or to a particular substitution parameter. The program "makes a proposal, and assesses the likelihood with and without, before and after the proposal was made. Depending on how different the likelihoods are, it might accept or reject that proposal, then make another," explains Schuettpelz. The outcome is a series of phylogenetic trees, each assigned a probability. More probable trees will ultimately come up more often in the sampling process.
Researchers will often use several different statistical methods, notes Kumar, so they will know that their results are at least not biased by methodology. And, people "often get hammered by reviewers" if they don't do one or another of their favorite method, adds John Huelsenbeck, a coauthor of MrBayes and now at the University of California, San Diego.
Reviewers often look for at least one estimate of likelihood. Yet because of the computational complexity involved in traditional algorithms, these calculations were unwieldy, recalls Nicolai Kandul, a Harvard University graduate student. When he was publishing his work on butterfly speciation, the computationally more efficient MrBayes was essentially the only program doing anything like maximum likelihood that could handle a data set of that size. "MrBayes is not exactly likelihood, but it's close."
A maximum-likelihood analysis, such as those found in MEGA, PHYLIP, and PAUP, will calculate the probability of the data under the assumption that the tree is correct. The programs will generate a single tree that is deemed best, and then calculate the likelihood that the data input generated that tree. MrBayes, on the other hand, uses Bayes' theorem to "reverse that conditioning statement," explains Huelsenbeck. It calculates the "posterior probability" that a tree is correct under the assumption that the data are correct, he notes.
"The results of a Bayesian analysis are very straightforward and easy to interpret," Huelsenbeck adds. The same cannot be said for other methods. "There are arguments over what exactly the probabilities we put on clades mean in bootstrapping," perhaps the most widely used phylogenetic method, notes Felsenstein, who introduced the statistical technique to the field in 1985 (in a paper cited nearly 8000 times).7
Ronquist, now at Florida State University, Tallahassee, says that while MrBayes is easier to use than many other programs, it still requires a lot from the user. "It's all done with command lines," notes Schuettpelz. Data are input as a run file, with the results generated as an output file. MrBayes can also produce a simple representation of a tree in ASCII format.
MEGA2, on the other hand, offers a Windows-based graphical user interface. "Convenience of use is the major factor" in its popularity, ventures Austin Hughes, from the University of South Carolina. Kumar concurs. "The whole philosophy behind the software has been to make tools for exploring sequence data easy for the user."
Many programs are available that are capable of doing what MEGA2 does, at least piecemeal. But, Kumar adds, "MEGA is an integrated platform for analysis, data exploration, and result visualization." The latest version, MEGA3, adds to this its own Web browser, allowing researchers to generate or obtain the data inside the program, and a built-in alignment system, which can automatically or manually align any portion of the data.8
The MEGA2 and MrBayes papers are hotter now than they have ever been, each with more than 40% of their total citations made in 2004. But the popularity of phylogenetic modeling has spawned a host of technology. Felsenstein's Web site lists more than 200 such software packages.9 And, new programs are being developed all the time.
David Swofford, also at Florida State, plans to replace his PAUP with another package that includes Bayesian methods, he says. This package is to be a module in the ambitious CYPRes project, which ultimately aims to link the variety of phylogenetic utilities, from searching and alignment to statistics and graphics, into a single, seamless operating environment capable of handling hundreds of thousands of taxa simultaneously.10 How heavily these works will be cited will depend, in part, on how they are communicated.