© LAUGHING STOCK/CORBISAs next-generation sequencing gets ever cheaper and higher-throughput, data file size continues to surge, creating some new, pressing needs for scientists. It’s not enough to be able to acquire big data using their own machines; researchers have to be able to store it, move it, and analyze it, and they often want to share it. Large collaborations complicate these steps. As a result, many researchers have resorted to planning their workflows around having a single site for analyses—it’s that, or physically shipping hard drives.
Not only are data files growing in size and number, especially those amassing sequence data, but data handling in genomics, epidemiology, and other fields has become unwieldy in other ways. Copying thousands of files, or sharing them with others, has become a laborious process, and as analysis options proliferate, choosing the right tools for the job can take some guesswork. Figuring out how to make data easy to handle and process is a big challenge for life scientists, according to Stan Ahalt, director of the Renaissance Computing Institute and a professor of computer science at the University of North Carolina at Chapel Hill. “The other challenge is learning how to utilize other people’s data to accelerate their own lab’s science,” he says.
Data demands in genomics and other omics projects have helped drive development of new and improved platforms for data analysis, sharing and pooling, and transfer. Profiled by The Scientist, four such advances can help with sequencing projects and more.
Pooled data analysis
Problem: A multisite international project, aimed at investigating the epidemiology of autism through population-based health records, needed a method for pooling harmonized data sets. The concern was avoiding the legal and ethical challenges associated with sharing patient data across national borders.
Solution: Using open-source tools, Kim Carter and Richard Francis of the Telethon Institute for Child Health Research in Perth, Australia, created a platform for so-called data federation, in which individual researchers keep control of their own data but can also—through a web-based interface—analyze data pooled among collaborators as if it were their own.
In reality, the researchers using the new tool—dubbed ViPAR for virtual pooling and analysis of research data—are pulling smaller bits of information from each data set for analysis. When they’re finished with an analysis run, the underlying data are wiped from the system’s memory, which helps address privacy issues. Data fed into the system can also be stripped of participant identifiers. Carter says ViPAR is applicable to projects in traditional and genetic epidemiology as well as targeted studies in genetics and genomics.
Getting started: The duo launched a more user-friendly version of the site this fall. It will include detailed instructions and cases to help scientists try out the platform or customize it for their particular problem. But the more complex the project, the more likely users are going to need help from bioinformaticians and their local information technology staff in customizing the program, Carter says.
Tip: Data analysis through ViPAR does still involve some data transfer because users must route the relevant bits of data to a central server for a given analysis run. Although the group is working to expand the system’s capacity for the amount of data transferred, the upcoming version can’t handle early-stage next-generation sequencing analyses, such as alignment. “[But] once you’ve nailed down your set of variables, or come up with putative SNPs you’re interested in, it’s at that point when the system could be really powerful,” Carter says.
Cost: The software platform itself is free but relies on all sites that share data having their own physical or virtual server. (It’s possible to use ViPAR if you don’t have a server, but you’ll have to have another site host your data.) A central site needs an analysis server, which should have enough memory and processing cores to handle the amount of data generated in a single analysis run of pooled data. About $5–$10K is the range for a project similar in scale to the autism project, called iCARE, says Carter.
Data storage, sharing, transfer
© TARIK KIZILKAYA/ISTOCKPHOTO.COMProblem: Several years ago, the free cyberinfrastructure available through the National Science Foundation (NSF), which gave scientists a way to store, manage, and share project data, was geared more for astrophysicists than it was for biologists. Using these resources required intense training to meet even simple needs such as data storage, says Nirav Merchant, director of information technology at the University of Arizona’s BioComputing Facility in Tucson.
Solution: Funded by the NSF, Merchant’s team created the iPlant Collaborative, which offers a more intuitive package of platforms for researchers—initially plant scientists, but now the broader life sciences community—to manage, analyze, and share data.
The iPlant Data Store, a cloud-based platform, provides up to 100 GB of storage space through which researchers can share data. On top of the Data Store, Merchant’s team built Discovery Environment, an analytical platform that packages commonly used sequencing analysis tools into user-friendly apps. A third platform, Atmosphere, is for cloud-based analysis. Connected to the Data Store, Atmosphere gives users multiple configurations of CPUs and memory for computationally intense analysis—and the ability to share. “If you and I are analyzing data together [from different locations] we can see the same screen and share the same mouse regardless of the platform—and all of this is running inside our cloud infrastructure,” he says.
The Data Store, the centerpiece of iPlant, is powered by iRODS, open source “middleware” that helps researchers store, manage, and share their data. Focused on bulk data handling and metadata (data about your data), iRODS has numerous capabilities but requires IT know-how to implement for specific projects. The iPlant Collaborative is one example of how experts have made iRODS more accessible for end users, but iRODS developers are also working to make the tool easier for life scientists.
Getting started: iPlant is scalable. Single labs or small collaborations can import their data into iPlant and use its tools immediately; institutions or large consortia that already have started projects using their own cyberinfrastructure can keep their data local and work with it using iPlant’s resources through the Powered By iPlant project, Merchant adds.
Tip: Check out iPlant’s hands-on workshops, online video introductions for each platform, and written tutorials for apps in its Discovery Environment, such as those designed for analyzing ChIPseq and RNAseq data. Users can create a single login that will work for all platforms, says Merchant.
Costs: Free for all. Users who need larger data storage or processing can request an additional allocation using an online form on iPlant’s site. Requests for particularly large allocations are evaluated by a committee, and so far, no user or group has had to pay, Merchant says.
Shared data analysis and analysis tools
Problem: There’s a lot of guesswork involved in choosing the right computational analysis tools for big data projects—not to mention the headaches associated with implementing new software, which can prove challenging for biologists with no informatics training.
Solution: Computational biologist James Taylor of Emory University in Atlanta, Georgia, and his colleague Anton Nekrutenko at Pennsylvania State University created Galaxy, a platform that now has thousands of data-analysis tools for biomedical research. Most are geared for genomics, though more tools specific for proteomics and imaging analysis are also becoming available on the platform. A community of Galaxy users helps vet the software tools for specific applications.
The ideal user is a bit larger than a single lab, but the platform scales up to the institution level. “[Galaxy] definitely has lots of collaborative features, so if you have a larger group of people using the same Galaxy instance”—that is, a single installation of Galaxy set up on the same server or commercial cloud—“then you get benefits from that,” Taylor says. People using different instances of Galaxy cannot perform shared analyses, but Taylor is working on a way to make that possible. For now, users across different instances can at least share analysis tools through the platform’s Tool Shed.
Getting started: You’ll first need to access Galaxy through the platform’s publicly available server, commercial clouds (ideal if data acquisition is sporadic), or another institution’s public server. Or, with your own server, you can perform your own private installation of Galaxy. You import your data into the platform, and use the web-based interface to deploy individual tools or build a workflow. “The hardest part is bringing your data in in the beginning, where your data is in its most raw form,” Taylor says. “Getting data out in the form of aggregate results or visualizations is not difficult.”
Tip: Take advantage of Galaxy’s online video and written training materials. In addition, free in-person training sessions are available at some conferences; the American Society of Human Genetics 2013 meeting this month (Oct. 22–26) is one example.
Cost: Galaxy is free, but if you want to download it and use it with your own infrastructure—the best option if you’re working with sensitive data, have large computing demands, or want to customize the software—you’ll need to have a server. Costs of cloud computing, through Amazon or other commercial vendors, can also add up, Taylor says.
Problem: Especially in genomics, data file sizes have become too large to send over Internet hubs using traditional mechanisms such as file transfer protocol. Moving data in and out of cloud storage can be even tougher because it sometimes requires users to break files into smaller chunks, adding to the slowness. Software available from cloud vendors or open source can make it “unusably slow to post large file data to the cloud or download it back out, especially if you’re at any distance from the cloud environment,” says Michelle Munson, CEO of Aspera.
Solution: Aspera’s commercial software relies on technology called “fasp” to replace other file transfer mechanisms. “Our protocol is designed in a radically different way, such that it allows extremely large file sizes to be transferred over long distances,” Munson says. If Internet bandwidth allows, this could mean transport speeds of up to 10 gigabits/second.
Aspera’s Connect Web Browser Plug-in, which installs on your web browser, is free to end users, but it does require a central site to have an Aspera Connect Server. Alternatively, the plug-in is available on major cloud platforms. The server software necessary for those regular, large transfers is available for purchase as a perpetual license or on cloud platforms as a subscription service.
Getting started: For more modest users who transfer data only occasionally, Aspera’s pay-as-you-go service through Amazon Cloud allows users to move as little as 100 gigabytes per month. Those looking to host their own server software can take it for a trial run before they buy. “We spend a lot of time with customers allowing them to evaluate the software and get to know what it can do, and determine how to use it for their own workflow,” Munson says.
Costs: The cost of server software is tiered based on the bandwidth of the connection you’re using it with, but ranges from $4K–$100K. Through cloud vendors such as Amazon, subscription plans are based on the amount of bytes transferred in a year, ranging from a penny to $1 for each gigabyte transferred.
» Develop a plan. Who is involved in your project? What data are you collecting? What platforms are you using? What questions are you asking from the data? How would you like to analyze the data set, who is analyzing it, and what are your expected outcomes? From the answers to these questions should come a basic workflow that will help guide the steps of data collection, management, and analysis.
» Don’t build your own platform. Even though it might seem easier at the outset, most scientists would be wise to avoid building their own software tools from scratch, says Stan Ahalt of the Renaissance Computing Institute. “Scientists will benefit in the long run if they invest time in both identifying and learning key data management tools that are available in the open source.”