© ISTOCK.COM/VASJAKOMANHuge swaths of modern biomedical science run on high-throughput DNA sequencers. These instruments can pump out data at an astonishing clip, producing gigabytes or more a day. But the end of the sequencing run is not even close to the end of the experiment. Researchers must somehow convert all those As, Cs, Gs, and Ts into knowledge by filtering, assembling, and interpreting the raw data to create a coherent biological picture.
That’s the role of bioinformaticians, and for labs lucky enough to have one on staff, data analysis is just an email request away. Many labs, though, aren’t so lucky. It’s not a lack of tools that is the problem: most popular bioinformatics software is free and open source. But downloading and installing those tools isn’t necessarily easy. Nor, for that matter, is using them. (See “Learning Bioinformatics,” The Scientist, July 2016.)
That’s because sequence-analysis tools largely run on the computer “command line,” invoked not with a mouse and clickable control elements but via lengthy and bewildering textual instructions that specify, say, which reference genome to use or the minimum size of a sequence match. Often these tools depend on other software to function, and are run in series, with the output of one tool being fed into another in a so-called “pipeline.” But as with any method—and make no mistake, says Carole Goble, a professor of computer science at the University of Manchester, U.K., bioinformatics is a method—researchers often need to try many variations and permutations to make their analyses work just right. And they must track precisely which data went into each analysis, the location of the data files that were created in the process, and the tools (and version numbers) used, if they are to properly document and repeat their work.
For all these reasons, many biologists are intimidated by bioinformatics. But there are tools to help, including Galaxy.
According to James Taylor, an associate professor of biology and computer science at Johns Hopkins University and one of the project’s originators, Galaxy provides a point-and-click web interface alternative to the bioinformatics command line, thus allowing researchers to easily create, run, and troubleshoot analytical pipelines.
“The first goal is really to make complex analysis more accessible,” Taylor explains.
And, because Galaxy maintains a detailed record of precisely what analyses each user has run and in what order, the software also fosters reproducibility, making it possible to repeat and share analyses, and/or revisit them at a later date.
The Scientist asked Taylor and other informaticians how researchers can build their own pipelines. Here’s what they said.
How do I get started?
Perhaps the easiest way to begin is to create an account at usegalaxy.org. This is a free, public, shared Galaxy “instance,” or running copy, that you can use without downloading the software onto your own computer—the equivalent of a public terminal in the library. This instance of Galaxy comes preconfigured with many of the most popular bioinformatics tools. As a shared resource, it is somewhat limited, Taylor admits: “We have to have quotas for the amount of disk space and the number of concurrent analysis jobs someone can have.” And some tools simply are not available at usegalaxy.org. (Some 80+ other publicly shared Galaxy servers, each featuring slightly different tool sets, also are available with a guide to their capabilities at wiki.galaxyproject.org/PublicGalaxyServers.) Those needing more privacy or specialization can run Galaxy on the Amazon cloud, using the Cloud Launch tool (launch.usegalaxy.org/launch). But, given the highly unpredictable cost of cloud computing, Taylor says, a local installation may be the most economical long-term option, assuming users have the IT support necessary to install and maintain the software (wiki.galaxyproject.org/Admin/GetGalaxy).
How do I upload my data?
However it’s run, the main Galaxy interface comprises three panels—a tool menu at left, a history at right, and a tool interface in the center. Users can import data by launching “Get Data” from the tool menu and selecting either a local file or remote resource. “If you have large sequence data sets that are on a server somewhere and accessible, you can fetch them directly from the URL into Galaxy,” Taylor says. Alternatively, Galaxy provides an interface to pull data off several remote servers, such as the UCSC Genome Browser (useful for downloading the coordinates of all human exons, for example), BioMart, and various model organism databases.
How do I find a new tool?
If you know the name of the tool you want to use (for instance, the alignment software Bowtie), just enter it in the tool menu search pane; if it’s there, it’s ready to run as is. Alternatively, assuming users have the necessary authority (that is, they are running a local or cloud-based Galaxy), they can install new tools from the Galaxy Tool Shed (toolshed.g2.bx.psu.edu). With some 3,990 tools currently available, the Tool Shed is a resource for sharing, documenting, and keeping track of different software versions in Galaxy, Taylor says. And that’s important, because it’s not always enough simply to have the most up-to-date version of a given tool; different versions may use slightly different algorithms, or sport unanticipated bugs. “Any tool that goes into the Tool Shed, you can always go back and get that configuration later.”
New tools that aren’t in the Tool Shed can also be imported; simply create an XML configuration file that describes the tool’s parameters, default settings, and instructions for mapping them to graphical elements on the screen. But newbies needn’t worry about how to do that, Taylor says: “Typically it’s the tool developers who actually will end up writing those config files.”
How can I build a new workflow?
Users can build pipelines directly on the “Workflow Canvas,” a graphical drag-and-drop interface in which tools are configured and interconnected by linking the output of one to the input of another. Say you want to use Trimmomatic to remove adapter sequences from a set of Illumina reads, map them against a reference genome with BWA, and call variants with FreeBayes. Simply select each application from the tool menu at left to place it on the canvas. Configure its behavior using the configuration panel at the right, and insert it into the workflow by creating connections between the appropriate input and output points. Alternatively, you can extract a workflow from the history pane, if you’ve already run through the desired steps once. Just select “Extract Workflow” from the History menu to apply those same steps to a new data set.
How can I run a published workflow?
More than 80 Galaxy workflows have been reported in the literature, according to PubMed, for research areas such as genomics (e.g., MGEScan), proteomics (e.g., Galaxy-P), and metabolomics (e.g., Galaxy-M and Workflow4Metabolomics). If the developer has shared that workflow in a public Galaxy instance, users can easily access and launch it via the Shared Data > Workflows menu item. If not, other options are available. Some developers create “virtual machine” implementations or “Docker containers”—preconfigured, ready-to-run software installations that users download and then launch locally or onto a server, for example, in the Amazon cloud (details vary, but published papers generally provide instructions). Others make their tools available via the Galaxy Tool Shed or git repository, as is the case with Galaxy-M. Professor Mark Viant and postdoc Ralf Weber, both of the University of Birmingham, U.K., built Galaxy-M in 2015 as a way to share their custom algorithms with the broader metabolomics community (GigaScience, 5:10, 2016). “We wanted to take metabolomics and make it more approachable and digestible to biologists,” Viant explains, “not just to the analytical chemist or to the computational scientist.”
How can I access external web services?
Suppose a user wanted to build a pipeline to align their sequence reads to a reference genome, identify key sequence variants, and then search external databases, such as Ensembl or PubMed, to triage those results. Galaxy alone cannot do that, says Michael Cornell, a clinical bioinformatics scientist at the University of Manchester, U.K. Its “centralized approach” requires administrators to bring resources, such as databases, in-house into their local Galaxy environment, he says.
Here’s where another pipeline tool, Taverna, can come in handy. According to Goble, Taverna (taverna.incubator.apache.org) is more of a power users’ tool, sporting “first-class support” for web services. (Web services are interfaces that users query remotely in order to run a simulation or render graphics, for instance.) “Researchers often use Galaxy and Taverna in tandem,” she says. Cornell, who lectures on clinical bioinformatics at Manchester to teach clinical scientists the ins and outs of sequence analysis, has helped students build such workflows. In one example, they identify variants in Galaxy, send them first to a web service at the University of Leiden called Mutalyzer to ensure the variants are named according to proper nomenclature, then on to PubMed to identify key literature citations, and finally return the results to Galaxy for further processing. “The two systems are quite complementary,” he says. (Some Taverna workflows are publicly available at MyExperiment.org.)
How can I document my workflow?
Once a user has created a workflow, they can download it or share it with others. (Select the workflow from the Workflows menu item and select “Share or Download” to obtain a link.) They can also create an annotated “Galaxy Page,” a web document that combines the workflow and plain-text explanations so researchers can document precisely what a workflow does, for instance, as supplementary data to a published paper. “You can actually write up the analysis or whatever and embed workflows and data sets in that description,” Taylor says. (Here’s one example: usegalaxy.org/u/xjasonx/p/hpgv-2.)
Where can I find more information?
There’s no shortage of online documentation for Galaxy (wiki.galaxyproject.org/Learn), including screencasts (vimeo.com/galaxyproject), tutorials (e.g., Galaxy 101: github.com/nekrut/galaxy/wiki/Galaxy101-1), online courses (such as this one at Coursera: www.coursera.org/learn/galaxy-project), lectures available at vimeo.com/album/3456144, and more. There’s also an annual Galaxy Community Conference (held June 25–29 this year at Indiana University in Bloomington), offering both scientific and technical talks as well as two days of in-person training.
Should I consult a bioinformatician anyway?
Short answer: yes. Taverna is not intended for newbies. Galaxy is easier by far, and more user-friendly than the command line, but it can be complicated to use nonetheless. Setting up a local Galaxy instance is particularly challenging, but even firing up a virtual machine on Amazon isn’t trivial. Pipelines and workflows are methods like any other, Goble says, and they too must be planned and debugged and revisited, not tossed into the mix as an afterthought. Her advice: invest in computational expertise, rather than circumventing it. “Recognize that the bioinformatician is a key component of the scientific team.”