DANIEL HUSONWith sequencing reads getting longer and cheaper in the past few years, researchers have begun ambitious efforts to catalog the genomic richness and variation within complex microbial and viral communities. So-called metagenomics studies involve collecting a sample of cells from their environment, breaking them open, chopping their DNA into pieces, and running the fragments on a sequencing machine.
Metagenomic analyses are more computationally demanding than genomic analyses because you’re working with a mix of diverse genomes rather than DNA from a more homogeneous microbial population. And even more than for genomics, one of the biggest challenges for metagenomics is making sense of the resulting data, says evolutionary biologist Jonathan Eisen of the University of California, Davis. Not only do scientists want to understand what microorganisms are present in a particular environment—not easy, considering that an average of 99 percent of them have never been cultured —and at what levels, but also what their functions are and how they compare with one another. “Sequencing is cheap, but that doesn’t mean you can put a community into a sequencer and make sense of it,” Eisen says.
Scientists have attempted to tame the complexity of metagenome studies by using improved sample preparation methods or analyses, or both. Computational approaches in particular need constant updating in order to keep pace with ever larger and ever-changing data sets.
The Scientist spoke with developers of tools for parsing genomic data from diverse communities of microorganisms. Here are some of the newest strategies and programs for taxonomic, functional, and comparative analyses.
Taxonomic, functional, and comparative analyses
Overview: MEGAN—for MEtaGenome ANalyzer—was initially created in 2007 to identify microbial organisms contaminating a sequencing study of DNA from a woolly mammoth bone. In addition to taxonomic analysis, the most recent version, released last year, rapidly compares multiple data sets and classifies the functions of genes in a metagenome; added features include metadata support and new ways of visualizing data.
Getting started: Before importing data into MEGAN, users should align their reads by comparing them to reference sequences using BLAST or a similar program. Alignment is the most computationally demanding part of the analysis. The resulting data are imported (and converted into a MEGAN file), and users will see that their reads have been assigned to nodes of the NCBI taxonomy. MEGAN incorporates three established analysis databases—KEGG (Kyoto Encyclopedia of Genes and Genomes), SEED, and COG (Clusters of Orthologous Groups of proteins)—which use different strategies for classifying gene function.
Learning curve: Beginners are easily able to run the software after an hour-long tutorial, followed by a three-hour, hands-on demo. The next one is planned during a metagenomics workshop to be held from September 8 to 12 at the Genome Analysis Centre in Norwich, U.K., says developer Daniel Huson, a bioinformatician at the University of Tübingen in Germany.
Look ahead: A new (as yet unpublished) algorithm the group has developed, called DIAMOND, promises to speed the initial alignment step by 16,000-fold. Aligning one million reads with BLAST would normally take about 44 days, Huson says. “This new thing we’ve submitted takes four minutes.” He has already started rewriting MEGAN to help keep pace with DIAMOND.
Considerations: A related tool, Qiime (pronounced “chime”), performs functions similar to MEGAN’s but allows users to do more sophisticated analyses. As a command line–based tool, however, it’s harder than MEGAN for biologists to learn, says Huson.
Cost: Free for academic use. That said, you’ll need a computer with at least 4?GB of RAM to run it . In addition, note that the tool incorporates an old (2011) version of KEGG—and because KEGG is no longer free, you’ll have to pony up $2,000 if you want the latest version.
WOOD AND SALZBERG, GENOME BIOL, doi:10.1186/gb-2014-15-3-r46, 2014Overview: Launched in September 2013, Kraken takes short DNA sequences from metagenomic samples and assigns them to taxa faster than conventionally used programs such as megaBLAST, but with similar accuracy (Genome Biol, doi:10.1186/gb-2014-15-3-r46, 2014). This speed boost comes primarily from a special database that has precomputed the answer to the question of which genomes have a particular k-mer, a short string (in this case, 31 base pairs) of DNA sequence. “That’s the big idea behind Kraken,” says Derrick Wood, who developed the tool as a graduate student in Steven Salzberg’s group at Johns Hopkins University School of Medicine in Baltimore, Maryland. “If you can take the individual k-mers within a read, and figure out which genome they can possibly be contained in very quickly, then you can accelerate classification.”
Depending on what’s in your sample, with Kraken you can classify 70–90 percent of your reads, which is on a par with (but faster than) programs that do classification, such as PhymmBL, Wood says. In contrast, other programs try to speed classification work by tapping into smaller databases, but that means they might classify only 10 percent of your reads.
Getting started: The first step is to build a database, or load the small one available on Kraken’s website (MiniKraken DB, which is constructed from complete bacterial, archaeal, and viral genomes in RefSeq). Researchers can add particular genomes as they see fit, but “right now it’s problematic to put every possible genome into the database because the memory requirements would be astronomical,” Wood says. Next, you’ll point Kraken to your file of reads (or assembled fragments) and to your database.
After the classification, you can run a simple text-based summary of results—how many reads at a particular grouping of species, for example—using a separate program called Kraken-report. For a more graphical look at your data, use Krona, a metagenomics data browser.
Learning curve: You don’t need to know exactly how the program works to get a result out of it—and so far, users have been able to run it well, Wood says.
Cost: Free, with source code available on the software repository GitHub (github.com).
Taxonomic, functional, and comparative analyses; data sharing
Overview: Introduced in 2007, MG-RAST is a Web server that allows researchers to obtain comparative and functional analyses of metagenomes. Users can elect to make their results public or can share (privately) with others. The tool includes numerous readouts of data quality, such as DRISEE, which uses artificial duplicate reads to estimate percent sequencing error on a sample, and nucleotide position histograms.
Getting started: Register at MG-RAST’s website and upload your data into the pipeline. Or, you can use the R?interface or the application programming interface (API) to write your own code—it’s all open source. “If you’re a bioinformatician, you’ll find [those] very easy. If you’re a biologist you’ll probably need some training,” says developer Folker Meyer of Argonne National Laboratory in Lemont, Illinois.
Costs: Free, if you can bear to wait about a week on average for results. For a more immediate answer, you’ll need to have access to (or pay for) your own computational muscle. (The new release, due out soon, promises a shorter wait time, Meyer says.)
Considerations: Although all the algorithms are open source, performing them yourself would be computationally expensive and time-consuming. Some new users want to have their students perform the quality controls, Meyer says; he advises them to try comparing MG-RAST’s and the students’ results to build trust in the tool.
Viral metagenome comparison; assembled virome analysis
Overview: Metavir is a website that helps researchers annotate and analyze viral metagenomes. Traditional tools for analyzing metagenomes compare the user’s sequences to those in a database; however, because few viruses are sequenced (and those that are sequenced are ones likely to infect humans or bacteria, creating inherent bias), users are able to annotate only 20–30 percent of their viral metagenomic sequences, notes François Enault, a researcher at the Laboratoire Microorganismes : Génome et Environnement in Clermont-Ferrand, France. Metavir is able to directly compare metagenomes to each other rather than to reference genomes. In addition, the new version described in March incorporates more markers of major viral families (BMC Bioinformatics, 15:76, 2014).
Getting started: The developers don’t plan to give in-person demos outside of France, so a guided video tour, available on Metavir’s site, is the best place to begin. Users can upload their sequences and wait about one week for results to come back. The website incorporates tools for classical analysis—for example, comparison with a reference database—along with the ability to directly compare viromes.
Learning curve: It takes no time at all to learn how to submit your data and to navigate the results, says Enault. “What is most difficult is how to interpret the results. What can you really say about your data?” says Enault. There’s no easy answer for that, he adds.
Considerations: Another Web server, called VIROME (Viral Informatics Resource for Metagenome Exploration) allows users to annotate their viromes more on a functional than on a taxonomic basis. Enault says the two tools are complementary.
Hi-C for metagenomes
Reconstructing individual genomes (assembly)
Overview: The laboratory method Hi-C, which cross-links interacting spots on chromosomes in a genome-wide fashion, was first established in 2010 to help researchers preserve and study the three-dimensional structure and regulation of genomes. Two new studies have independently adapted this method to mixed microbial communities, with the goal of resolving individual genomes from a mix of reads (G3, doi:10.1534/g3.114.011825, 2014; PeerJ, 2:e415 doi: 10.7717/peerj.415, 2014)
Authors from each of the two studies needed the ability to analyze the sequences obtained from Hi-C metagenomes. To help develop their tools, each group used a simple “synthetic” microbial community, for which the full sequence of each member was known.
Getting started: Each paper describes its method in detail, and both groups have made their source code available via GitHub. Eisen’s graduate student Christopher Beitel has also published a step-by-step guide to the computational analyses on his blog (datascimed.ghost.io). “There should be more than ample information for people to be able to replicate this,” he says.
Learning curve: Hi-C is a complicated protocol, but an experienced microbiologist will have no trouble with it, Beitel says. A detailed video protocol of Hi-C is available in the original paper (J Vis Exp, 2010, 39:1869, 2010). The computational aspect requires basic bioinformatics training, but Beitel’s blog post aims to make it doable even for novice users.
Costs: The algorithms are free and open source. The lab method involves use of biotinylated nucleotides, which are expensive, says Beitel. Also, you end up sequencing DNA fragments that have linked with themselves, which means you waste a lot of sequencing power. But you can still recover whole genomes of the most abundant members of the community, and that may be all you need. Eisen’s group is working on modifying Hi-C to capture only non-self ligating DNA fragments, Beitel says.
Considerations: Both groups caution against diving into these methods before thinking about whether they will work for your question. “It’s still early days,” says Jay Shendure, an associate professor of genome sciences at the University of Washington in Seattle who led one of the teams. At this stage, it’s best to try the technique on simple microbial communities.