Benching Bases
How to do heavy computational lifting in genomes and transcriptomes
You've unpacked your next-generation sequencing system and popped in some DNA or RNA. Five days later, you've sequenced 50 million tiny strings of nucleotides. Then what?
Based on their sequences, you have to align all the fragments, called "reads," with the help of a reference genome—a fully assembled sequence from the same species. In the absence of a reference, you're left with assembling the genome based solely on the portions of the reads that overlap with each other. For both alignment and assembly, "computation becomes a big issue," says Steven Salzberg, director of University of Maryland's Center for Bioinformatics and Computational Biology. "That's a huge amount of data, and in fact even streaming the data off the machine onto other computers causes network bandwidth problems."
That's because most newer technologies generate shorter reads—roughly 25 to 50 nucleotides in...
The good news is that a new wave of alignment and assembly software solutions has caught up to next-generation sequencing. The Scientist talked to some of the developers. Here's what they said:
USER: René Warren, Bioinformatician, BC Cancer Research Centre, Vancouver, British Columbia, Canada
Project: Developing an approach for sequencing all the types of T-cell receptor genes present in blood
Problem: Warren's group needed to capture the portion of the T-cell receptor gene responsible for generating millions of receptor variations in a healthy individual. Because that hot spot is so diverse between individual T-cell receptors—it has more than 1015 theoretically possible sequences—there is no single reference genome. "Basically you're doing a de novo assembly," Warren says. He needed a way to assemble the diverse genetic region, 12 to 16 nucleotides, from scratch, using short-read data.
Solution: Last summer, the group developed and tested iSSAKE , software that helps them assemble the genomic hot spot. It works by finding certain reads that are part of a known gene segment—the V-gene—that neighbors the hot spot. The strategy is like that used in assembling a section puzzle of a landscape, and picking out the pieces that include the transition from one image to another, like the border between the grass and sky. "We segregated all the reads that aligned to the V-genes but had unmatched bases at the end," Warren says. "Presumably these reads would actually capture part of the [neighboring spot of interest], maybe all of it."
The group tested the algorithm on a data simulation of T-cell receptors based on GenBank data and found that for read lengths of 36 nucleotides, the method is more than 90% sensitive and more than 99.9% accurate even for relatively rare T-cell receptor types (Bioinformatics, 25:458–64, 2009).
Considerations: The group is working with wet data and has found a 96% intersect between their computational reconstruction with iSSAKE and a small sample using traditional Sanger sequences. "The challenge that remains is that there are still some errors in short-read data—this might affect the quality of the outcome," Warren says.
Download: ftp://ftp.bcgsc.ca/supplementary/iSSAKE (free)
USER: Nicholas Bergman, assistant professor of biology, Georgia Institute of Technology, Atlanta
Project: Mapping the transcriptome of anthrax-causing bacteria (Bacillus anthracis) and measuring gene expression levels
Problem: Bergman's group uses Applied Biosystems' SOLiD gene sequencer, because it produces more data than do other new platforms. But when they started using the system, the software tools available for analyzing the data were too slow and had a hard time dealing with sequencing errors (mismatches between the short reads and the reference genome). "It would take [an inaccuracy] and say this read is unmappable," Bergman says. The group needed a new algorithm that would tolerate these errors and move through vast amounts of data.
Solution: The group began working on the problem in February of last year, when Bergman's graduate student, Brian Ondov, proposed using an algorithm that performs fast string searches and that is commonly used for detecting plagiarism. Based on that algorithm, they built a search tool that maps SOLiD sequence reads very quickly and allows users to set the error tolerance, thus selecting their own preferences for speed and accuracy.
That tool became SOCS (short oligonucleotide color space), which Bergman has since been using to measure patterns of gene expression (Bioinformatics, 24:2776–77, 2008). The program aligns fragments of bacterial mRNA to a reference genome to see which genes are expressed. "We're basically using this to replace microarrays," Bergman says.
Considerations: The software is appropriate for functional genomics research using SOLiD, with techniques such as ChIP-Seq and RNA-Seq. "It's completely flexible," Bergman says. "The higher you set your tolerance [for errors], the more you map, but the longer it takes."
Download: http://socs.biology.gatech.edu/Usage.html (free)
USER: Benjamin Jackson, graduate student in the lab of Srinivas Aluru, Iowa State University, Ames
Project: Assembling large genomes from scratch
Problem: Assembling large genomes— which naturally contain a complex mix of deletions, duplication, and rearrangements in sequence—can be quite daunting, especially with short sequence data sets. Thus, assembling these genomes de novo takes a boatload of computer power.
Solution: Jackson began to tackle this problem in 2007 with a two-pronged approach.
First, he obtained paired short reads— portions of the genome that are separated by some known approximate distance. Knowing the distance helped limit the complexity of processing, essentially reducing the number of puzzle pieces for assembly.
Second, the group used a supercomputer, equipped with 1,024 processors, to handle the large data set. It took 4 months to write and debug software for the computer, but the reward was immediate: They can now assemble the Drosophila melanogaster genome in about an hour. Such a feat would have been impossible with a typical computer, Jackson says (BMC Bioinformatics, 10[Suppl 1]:S14, 2009).
Considerations: Like all de novo short sequence assembly methods, the software requires 100% coverage, meaning that, on average, you'll need to sequence the same location in the genome 100 times (before assembly). "The main pro is that it does scale to large genomes," Jackson says. "It's the first [bioinformatics tool] that scales to gigabase-sized genomes like mammals and plants."
Download: Not available yet, though Jackson says the group will make the software available to the research community when it's fully developed.
USER: Gunnar Rätsch, group leader, Friedrich Miescher Laboratory of the Max Planck Society, Tübingen, Germany
Project: Comparing short reads of spliced RNA with reference DNA sequences to predict where splicing occurs (RNA-seq)
Problem: Several methods previously designed to align mRNA short reads with genomic DNA don't work well with newer, short-read sequencing techniques and with heavy splicing (i.e., RNA that is much shorter than the DNA or may have several isoforms). Rätsch's group wanted an algorithm better suited to the new sequencing platforms.
Solution: In 2007, Rätsch's group began developing a program called QPALMA that incorporates splice site predictions as well as the quality of the read when aligning a read to the genome. (Each base comes with a quality score assessed by Illumina and other platforms). "When we align the read, we check whether the read is identical to the genomic DNA," Rätsch says. "If there's a mismatch, we take the score into account."
The software also makes predictions about splice sites, a step that takes roughly 12 hours for 2.5 million reads using a singleprocessor computer, according to Rätsch's estimates (Bioinformatics, 24:i174–80, 2008). "This step can be easily distributed over a cluster of 20 processors and would then take less than an hour," Rätsch says.
Considerations: Right now, the software works well for relatively short, small genomes, says Rätsch. "The difficulty is when the introns are extremely long, that's when QPALMA can get inefficient." The group is working on a new version of the software that will help address this, to be completed, Rätsch hopes, within two months.
Download: http://www.fml.mpg.de/raetsch/projects/qpalma (free)
USER: Cole Trapnell, graduate student in the labs of Steven Salzberg and Lior Pachter, University of Maryland, College Park
Project: Aligning short RNA fragments to cover entire genomes (RNA-seq for large genomes)
Problem: Doing RNA-seq is difficult because genes can be spliced in multiple ways, and it's tough to predict where splicing occurs. To add to the challenge, some genes aren't always highly expressed. "When you have light coverage, it's much harder to align the reads across the splice junction," Trapnell says. "If your goal is to discover all the genes right down to the nucleotide level, and the genes are not highly expressed, you'll only see parts of them sometimes."
Solution: Last summer, Trapnell helped graduate student Ben Langmead create Bowtie, a sequence alignment algorithm that compresses genome data and allows users to work with large data sets using a commonly available desktop computer.
To deal with the unique challenges of RNA-seq, Trapnell added a layer of software to Bowtie. The new software, called TopHat, works by first mapping as many RNA fragments as possible to the reference genome, creating a skeleton for alignment. The skeleton corresponds roughly to the expressed genes in a genome, and TopHat uses the unmappable reads to discover how those genes are spliced (Bioinformatics Epub, March 16, 2009).
Considerations: The program "works best when you have a decent reference genome assembly," Trapnell says. "If you have no assembly at all you can't use the program at all." Also, he adds, as with any new software, it makes sense to use multiple tools and check your results.
Although the group did not directly compare TopHat to QPALMA (see above), there will be key differences, Trapnell says. "TopHat is designed to be very fast and is thus usable on large datasets and large genomes, where QPALMA is designed to be very sensitive, but not as fast."
Download: http://tophat.cbcb.umd.edu (free)