GREGORY SPENCER / ISTOCKPHOTO.COM
Unless you’ve been hiding under a rock for the past few years, you know next-generation DNA sequencing is all the rage. The technique has gone from gee-whiz to practically routine in the five years since sequencing company 454 Life Sciences jump-started the revolution. In the past year alone, next-gen sequencers powered major strides in the 1000 Genomes Project and the Human Microbiome Project; identified the genes underlying Mendelian disorders like Joubert Syndrome and Miller Syndrome; and cracked the genomes of the apple, the body louse, and the cholera strain now ravaging Haiti.
“In some respects, the potential of [next-gen sequencing] is akin to the early days of PCR, with one’s imagination being the primary limitation to its use,” Michael Metzker, senior manager at the Human Genome Sequencing Center at the Baylor College of Medicine, wrote in early 2010 (Nat Rev Genet, 11:31-46, 2010).
The rule of thumb in the genomics community is that every dollar spent on sequencing hardware must be matched by a comparable investment in informatics. Researchers at sequencing powerhouses like the Broad Institute of Harvard and MIT and the Wellcome Trust Sanger Institute in the UK might be relatively insulated from such concerns. But for the average researcher, those extra dollars—and the infrastructure and expertise they buy—are often out of reach.
Fortunately, there is a diverse array of free and low-cost tools and an active user community to help. The question for most new users is: where to start? The Scientist asked next-gen sequencing experts to weigh in on the most common newbie questions. Here’s what they said.
What IT infrastructure do I need?
The short answer: it depends. Sequencing datasets can be enormous, but not all datasets are equal. Whole-human-genome–sequencing projects, including the raw sequences, alignments, and variant calls, can run into the hundreds of gigabytes per sample, says David Dooling, assistant director of informatics at the Genome Center at Washington University in Saint Louis; ChIP-Seq datasets (i.e., the output of chromatin immunoprecipitation experiments) are much smaller, probably just a few gigabytes in size.
So, the answer to how much space you need to hold all that data is also: it depends. The Center for Biomarker Research and Personalized Medicine at Virginia Commonwealth University has one Applied Biosystems SOLiD 4 sequencer, which it acquired in early 2010. Edwin van den Oord, the Center’s director, says the facility has some 35 terabytes (35,000 gigabytes) of disk space to store its data, some of which is collected in-house, but most of which is actually outsourced as part of a massive effort to sequence the methylomes of 1,575 individuals. “For just the data we are generating in-house, we wouldn’t need that much storage space,” he says. But, he adds, even 35 TB isn’t enough: “We will need to purchase more disk space on the cluster to be able to analyze all the data.” Shianna’s facility at Duke, which has sequenced some 200 whole human genomes and another 100 exomes (that is, the protein-coding regions alone), has 300 TB of disk space now, “and much of it is already full.”
The other key ingredient, besides storage space, is computing power. These files are so large, they often cannot reasonably be analyzed using a desktop computer; instead, you need access to a cluster—a kind of ad-hoc supercomputer built by networking a number of smaller computers in parallel. Sequence Variant Analyzer, for instance—a software tool developed at Duke that annotates identified variants and displays them in their genomic context—is “a bit of a memory monster,” says Shianna. “On the low end it needs 24 to 32 gigabytes of memory.”
What if I don’t have that infrastructure?
Many universities offer cluster resources, but not all do. For researchers without cluster access, several Web- and cloud-based alternatives exist. (See “Harnessing the Cloud,” The Scientist, October 2010.) One example: Amazon Web Services. AWS offers several basic services, including the “elastic compute cloud” (EC2), which has nearly limitless computing infrastructure, and the “simple storage service” (S3), effectively a massive hard disk. Anyone can establish an account with AWS, requisition a virtual machine—a computer interface that they access via the Web, but whose hardware resides in Amazon’s cloud—log in, install the tools they need, and start analyzing data.
There’s a lot of people out there and a lot of expertise, so don’t try to do it all yourself. Leverage the knowledge around you. There’s no sense in reinventing the wheel.
—David Dooling, Genome CenterWashington University in Saint Louis
Such pay-to-play systems offer tremendous flexibility; by letting Amazon (or other cloud service providers, such as Google and Microsoft) do the heavy lifting on computationally intensive operations, cloud-based solutions free users—for a fee—from worrying about purchasing, maintaining, and upgrading their IT infrastructure. “The latest estimates that I’ve seen are that Amazon probably has on the order of several hundred thousand CPUs and something on the order of hundreds of petabytes [1 petabyte equals 1000 terabytes] of disk storage available,” says Andreas Sundquist, CEO and cofounder of Palo Alto–based DNAnexus. “The number of places in the world that have access to that amount of computer and disk storage is very small.” Some civic-minded researchers have developed a preconfigured bioinformatics-based virtual Linux machine as an “Amazon machine image” (one that can be booted into EC2 in place of the stock Amazon system); it is available at www.cloudbiolinux.com.
Another option is Penn State’s Galaxy (galaxy.psu.edu/). According to its Web page, “Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more.” The system includes extensive documentation and tutorial videos. Mark Adams, who runs the sequencing core at Case Western Reserve University, calls Galaxy “a wonderful system for integrating different kinds of data and querying across them, particularly in a very coordinate-based way.”
For those who’d like a more polished cloud-computing experience, there are commercial options, including DNAnexus (dnanexus.com) and GenomeQuest (www.genomequest.com). DNAnexus accepts data via direct upload or on the fly from a net-connected sequencer, enabling such analyses as variant finding, RNA expression analysis, and ChIP-Seq. “You don’t have to think about where that analysis is going to run; you don’t have to think about where the results are going to be stored. All that is managed for you in the cloud,” Sundquist says. The service (built on AWS) costs $20/gigabase/2 years for academics, and $5/gigabase for sequencing facilities that use the system to share data with their users.
What programs should I use?
Again, it depends. What analyses are you trying to perform? How comfortable are you with UNIX? And, can you program?
Literally hundreds of bioinformatics tools for next-gen sequencing are available, from polished commercial products to rough-around-the-edges freeware solutions; more than 360 are listed and described on the excellent software wiki available at SeqAnswers.com (seqanswers.com/wiki/Software). (See a table of popular no-cost options at the end of the article.)
Unfortunately for newbies, few of these tools hide their internals behind pretty graphical user interfaces. “There’s a lot of free, pretty good software available for doing analysis, but virtually all of that free software and cutting-edge software is UNIX command line–based,” says Adams. For the most part, these programs are essentially data filters and file converters: they take data of one format, process it, and output it in another format. For simplicity, most genome centers compose do-it-yourself bits of code to guide the raw sequence data through these steps, funneling the output of one program into another to clean it up, collect quality metrics, align it to a reference genome, and so on.
Such a software “pipeline” may sound overly elaborate, but when it comes to filtering data files containing literally millions of records, there really is no other choice. Thus, it pays to have at least one person with decent UNIX skills on your team. “Basic UNIX command-line syntax will get you so far with this type of data,” says Daniel MacArthur, a postdoc at the Wellcome Trust Sanger Institute who blogs and tweets (@dgmacarthur) extensively on genomics issues.
How do I view the raw data?
Generally speaking, you don’t. And you don’t need to. There’s so much of it, there’s little to be gained from the exercise; instead you’ll usually view the processed data—lists of SNP calls and the like. But there are significant exceptions, says MacArthur. In particular, he says, it’s often worth it to closely examine the actual sequence reads supporting variant calls before diving into validation studies.
Take every opportunity to actually look at the data in as many different ways as possible, because you can get fooled.
—Daniel MacArthur, postdocWellcome Trust Sanger Institute
“That’s the single biggest piece of advice that I tend to give to people just starting out with these types of analyses—take every opportunity to actually look at the data in as many different ways as possible, because you can get fooled,” MacArthur says. Single-nucleotide variant calling, for instance, is relatively robust. But insertions and deletions (or “indels”) are famously problematic: some indel reads are discarded because they don’t seem to align properly to the reference; others are called as clusters of SNPs instead. “That is something that, as soon as you look at the reads you can see that there’s something really wrong there,” he says.
You can view raw data using a genome browser like the Integrative Genomics Viewer (IGV), which displays overlapping reads as a “pileup” on the reference genome to which they were aligned. “IGV is just a great, intuitive, easy-to-use tool,” MacArthur says.
To view the raw data—the actual text file output from the sequencing instruments themselves—you can use UNIX command line tools like head and more to decide whether your data is formatted correctly for various analysis programs.
Where can I go for help?
Fortunately for a discipline as complex and rapidly changing as next-gen bioinformatics, there’s no shortage of help available, whether as user groups or mailing lists, online forums or Web tutorials. Tool developers will generally respond to e-mail queries, notes Dooling, as will other more experienced researchers. One good starting point: the popular SeqAnswers.com, with some 6,400-plus active members.
“There’s a lot of people out there and a lot of expertise, so don’t be a hero, don’t try to do it all yourself,” Dooling says. “Leverage the knowledge around you. There’s no sense in reinventing the wheel.”
|SELECTED FREEWARE SEQUENCE ANALYSIS TOOLS|
|Dindel||http://sites.google.com/site/keesalbers/soft/dindel||small in/del discovery|
|Erds||http://www.duke.edu/~mz34/erds.htm||copy number variant discovery|
|Pindel||http://www.ebi.ac.uk/~kye/pindel/||small in/del discovery|
|Samtools||http://samtools.sourceforge.net||tools for manipulating aligned data (including SNP finding)|
|Sequence Variant Analyzer||http://www.svaproject.org||displays variants in their genomic context|
|Cufflinks||http://cufflinks.cbcb.umd.edu||measures transcript abundance|
|De Novo Assembly|
|Oases||http://www.ebi.ac.uk/~zerbino/oases/||assembly from transcriptome data|
|Integrated Genome Browser||http://www.bioviz.org/igb/|
|Integrative Genomics Viewer||http://www.broadinstitute.org/software/igv/|