Harnessing the cloud
Free platforms for powering your genomics and proteomics data analysis using processing power rented in the sky
No longer must biologists covet the computer clusters of their colleagues. Cloud-computing—computing power accessed over the Internet—lets biologists without their own computer cluster store and analyze floods of data. The main space for rent is on Amazon Web Services (AWS), where, for less than $1 per hour, researchers can have access to powerful processors only as they need them.
Integrating bioinformatics programs with the cloud requires considerable computer expertise. Luckily, developers are creating free and easy-to-operate platforms for genomics and proteomics. After paying for space on the cloud, researchers can use these programs to analyze data as simply as they’d plan a vacation online.
Cloud-computing is relatively new, therefore the bugs have yet to be realized. Certain developers voice concern about Amazon’s virtual monopoly on rentable cloud space. And as for biologists, the first obstacle may simply be learning which applications suit their needs best.
Hypothetical concerns aside, Deepak Singh, a business development manager at Amazon Web Services who works with scientists interested in data analytics using the cloud, predicts that it will enable the increasingly crucial role of bioinformatics in the life sciences. “Sure, there are facilities with their own computer infrastructure,” Singh says, “but if they had to do it again today, they’d think about it differently.”
The Scientist talked to developers of some emerging free programs that use the cloud to help solve some basic bioinformatics problems biologists face in analyzing genomics and proteomics data. Here’s what they said.
Problem: Genome-wide association studies and microarray experiments require computationally intense statistical analyses. But few biologists have the computer resources or programming skills required, limiting who can collaborate on such analyses.
Project: Elastic-R (www.elastic-r.org)
Developer: Karim Chine, Director of Cloud Era Ltd, Cambridge, United Kingdom
Solution: Chine created Elastic-R for biologists to run complicated statistical analyses on vast datasets. The platform allows researchers to run stats with a variety of statistical software programs from their laptop, iPhone, or any device with a Web browser by computing the statistics on Amazon’s cloud. What’s more, users can grant others access to their “virtual workbench” so that collaborators in different places can manipulate the data together in real time. Chine compares the collaborative aspect of the program to Google docs. “Everything you do with the data,” he says, “right down to entering in numbers onto Excel, can be done on a browser that you, in New York, and I, in Boston, can see.”
Pro: Users can take snapshots of their analysis on the cloud, which includes the environment required to process the data, the spreadsheets, and all of the interfaces created to explore the data. Snapshots can be shared or downloaded to local machines. “Let’s say the reviewers of a paper want to know how we did our analysis—we could just give them a label identifying a snapshot,” Chine explains. “The reviewer can then manipulate our data using our specific software tools from within a standard Web browser to check our results.”
Con: At the moment, Elastic-R incorporates various open-source programs because paying for the use of these programs isn’t an issue. However, although Elastic-R is theoretically able to incorporate private statistical software like MATLAB and Sas as well, it can’t until companies think of a way to let people pay to use their software on a need-to-use basis in cloud environments. Traditionally the companies that own private software charge for a license to use it; however, when the software runs in the cloud, it might be used only briefly by certain collaborators in a project, complicating payment schemes. Companies with licensed programs like Sas and MATLAB should adopt a pay-per-use software model that could be implemented on Amazon’s cloud, says Chine.
Problem: Publically available gene- and protein-related applications and databases such as HapMap, GeneAtlas, RefSeek, and GeneWiki don’t pool all of the information available about any given gene that a researcher might identify in a genome sequence, a genome-wide association study, or a microarray analysis.
Project: BioGPS (biogps.gnf.org)
Developer: Andrew Su, associate director of bioinformatics, Genomics Institute of the Novartis Research Foundation, San Diego
Solution: Su developed BioGPS to aggregate such information and allow users to dictate what parts of each database or analysis tool are most useful to them. Users customize their own gene annotation environment as they would their Google homepage—only that instead of displaying updates on weather, celebrities, and the local news, users can choose to see 3-D crystal structures, alternative splicing information, related reagents, or gene expression data. More than 300 applications (called plug-ins) are currently available on BioGPS, ranging from the annotation tool Ensembl to the gene expression database GeneAtlas. And with the same type of social networking features that allow Amazon.com to recommend books to customers, BioGPS users can learn what the most popular plug-ins are and which users pay attention to the same genes they do. Su says the social features reduce duplicated efforts and enable discovery. “You gain serendipity,” he says.
Pro: Because BioGPS improves as more people join, Su strives to make it enticing for a community beyond computer-savvy bioinformaticians by making it simple to use.
Con: Plug-ins provide data to users in an unstructured format, which allows data providers with basic programming skills to contribute to BioGPS. Yet as a result, this unstructured data isn’t easily formatted for further bioinformatics analyses. In other words, BioGPS plug-in architecture is amenable to aggregation, but not integration.
Problem: Asking multitiered questions that require data integration from various programs is computationally intensive when datasets get large. For example, to find SNPs and then determine which genes in a sequence contain the most SNPs, one needs to compare all human exons (roughly 357,000) to all human SNPs (roughly 12,350,000).
Project: Galaxy (g2.bx.psu.edu)
Developer: James Taylor, assistant professor, Departments of Biology and of Mathematics and Computer Science, Emory University, Atlanta
Solution: Taylor’s team created Galaxy, a pipeline for analyzing sequencing data from start to finish on one online data analysis platform. It uses the information from various genomic databases and algorithms from existing bioinformatics programs to run analyses on multiprocessor computer clusters in the cloud. Galaxy lets users start from raw data, derive information from that data, and compare or integrate it with other data sources—all without leaving a Web browser. For example, a user can import raw sequence data from a ChIP-seq experiment, align sequences to a reference genome with the program Bowtie, identify regions enriched for protein-DNA binding with the program MACS, and compare the enriched regions to annotations from a database like Ensembl.
Pro: The variety of pipelines created and used by some 10,000 registered Galaxy users is a testament to the usability of the platform, says Taylor.
Con: While storing every dataset is possible with Galaxy, the many layers in these analyses require a lot of storage space. Using Amazon’s cloud for all of this storage can come with a hefty price tag.
Problem: Aligning genomes and detecting SNPs becomes incredibly complicated as the number of DNA sequences available for comparison increases.
Project: Crossbow (bowtie-bio.sourceforge.net/crossbow/index.shtml)
Developer: Steven Salzberg, Director of the Center for Bioinformatics and Computational Biology, University of Maryland
Solution: Salzberg and his team developed “Crossbow” for whole genome analysis in the cloud. Crossbow combines two programs (the alignment program, Bowtie, with the SNP detector, SOAPsnp) designed to handle the mountains of data generated by next-generation sequencing. Unlike Galaxy, this program is made for mapping billions of reads and then computing SNPs, says Salzberg.
Pro: Analysis on speed. Using Crossbow, Salzberg’s team analyzed more than 35-fold coverage of a human genome in 3 hours for about $85 using a 40-node, 320-core cluster on Amazon’s cloud.
Con: Researchers must have access to a high-speed computer connection from the onset because uploading large human datasets onto the cloud requires sufficiently powerful bandwidth. “If I gave you 300 gigabytes of data, I doubt you could upload that dataset at most universities,” says Salzberg.
Problem: It’s not just gene-related experiments that are producing a flood of data—proteomics has blossomed, too, as new and improved mass spectrometers generate mountains of raw data comprising mixtures of thousands of proteins. Researchers identify proteins in the mix by using powerful computers to compare the sample against thousands of existing entries in a database. “It’s hard to own the computers fast enough to process the data,” explains developer Andy Greene, “and hard to spend time and money to maintain them once the analysis is over.”
Project: Virtual Proteomics Data Analysis Cluster (proteomics.mcw.edu/vipdac)
Developer: Andy Greene, Director of the Biotechnology and Bioengineering Center, Medical College of Wisconsin
Solution: Greene developed ViPDAC for cloud-based analyses on large proteomics datasets. With ViPDAC, multiple analyses can run simultaneously, so that users can try things they normally wouldn’t have tried before because of a lack of computer resources, Greene says. For example, a user can test how changing the parameters in parallel analyses alters the output of a protein search. ViPDAC gives users estimates of how much time and money an analysis will take so that they can determine how many processors they’d like to rent. “If there’s a conference in 2 weeks, you can opt for the ViPDAC analysis that will take 1 week, use 10 processors, and cost 75 cents” says Greene. “Or if you need the data for a lab meeting today, you might decide to pay the $250 for 5,000 processors because it’s on your PI’s card.”
Pro: Greene’s team compared a search for proteases in a sample of rat urine using a local cluster and using the cloud with ViPDAC. From their results, they extrapolated that searching the entire sample on a desktop computer would take 140 days, as compared to 5.7 days using a 20-node ViPDAC cluster.
Con: Uploading data onto the cloud is the rate-limiting step. If the connection between an institution and the cloud is slow, it can take hours to upload data from a local server onto the cloud.