You've unpacked your next-generation sequencing system and popped in some DNA or RNA. Five days later, you've sequenced 50 million tiny strings of nucleotides. Then what?
Based on their sequences, you have to align all the fragments, called "reads," with the help of a reference genome—a fully assembled sequence from the same species. In the absence of a reference, you're left with assembling the genome based solely on the portions of the reads that overlap with each other. For both alignment and assembly, "computation becomes a big issue," says Steven Salzberg, director of University of Maryland's Center for Bioinformatics and Computational Biology. "That's a huge amount of data, and in fact even streaming the data off the machine onto other computers causes network bandwidth problems."
That's because most newer technologies generate shorter reads—roughly 25 to 50 nucleotides in length—than those generated using traditional Sanger sequencing. The newer methods create smaller ...