Desktop Sequence Analysis Software

Few biological fields have benefited from technological advances as much as genomics. The field could not be where it is today without progress in automated sequencing methods and in software to interpret, annotate, and manage the voluminous data that these automated sequencers churn out. Without this latter development, researchers would be hard pressed to read and understand these gigabytes of data--the equivalent of having an encrypted encyclopedia without a deciphering key. (See related sto

Nov 26, 2001
Jeffrey Perkel
Few biological fields have benefited from technological advances as much as genomics. The field could not be where it is today without progress in automated sequencing methods and in software to interpret, annotate, and manage the voluminous data that these automated sequencers churn out. Without this latter development, researchers would be hard pressed to read and understand these gigabytes of data--the equivalent of having an encrypted encyclopedia without a deciphering key. (See related story, Chromosome 22 Provides Human Genome Preview )

Like sequencing technologies themselves, sequence analysis software has evolved substantially from its original incarnation. In the 1980s, before graphical operating systems became standard, these packages primarily consisted of small, command line-based applications. The "GCG® Wisconsin Package™," developed by Genetics Computer Group (GCG) at the University of Wisconsin, was a popular suite of such utilities. Unfortunately, using a variety of separate applications to perform various tasks is tedious, and it is possible that one program will not be able to read the output of another program.

But times have changed. Disparate command line-based utilities have, for the most part, been replaced by highly integrated applications with graphical user interfaces. The Wisconsin Package, now offered by San Diego-based Accelrys, currently consists of over 140 separate applications, according to Kevin Kendall, desktop applications product manager at Accelrys. But many users now interact with it through a Web interface, called SeqWeb, which hides the various programs' nuts-and-bolts behind a pleasing facade. A number of companies now supply extraordinarily powerful general-purpose and niche market applications that can ease many routine molecular biology analysis and management tasks. The nature of these programs has changed too, having evolved from simple utilities to applications capable of assisting in experimental planning, troubleshooting, and reagent ordering. In many respects, however, Web-based tools have supplanted standard desktop applications, and the nature of these programs is changing to reflect this trend.

Types of Software

Programmers can create sequence analysis software packages in any style they choose, such as total solutions that perform a variety of analyses, or niche programs intended to solve a particular problem. They can also be Web-based or installed on a local machine or server. Total solution packages come in two basic varieties: modular or "all-in-one" applications. Modular packages, such as Bethesda, Md.-based Informax Inc.'s Vector NTI Suite, Durham, N.C.-based Scientific & Educational Software (Sci-Ed)'s Clone Manager Suite, and Madison, Wis.-based DNASTAR's Lasergene consist of a basic "container" program to which specific analysis functions are added by installing additional modules. In contrast, all-in-one applications such as Toronto-based Redasoft's Visual Cloning 2000 and Accelrys' OMIGA and MacVector contain all features in a single, integrated program.

Modular implementations offer a number of benefits. First, users can add new features, and upgrade existing ones, simply by obtaining new modules. In addition, users of modular software can save money by purchasing only those modules that they need. Finally, some companies, such as Hamilton, New Zealand-based Genamics and Alameda, Calif.-based MiraiBio, provide open application programming interfaces (APIs) so users can create custom add-ins to their software.

Niche software packages, such as Ann Arbor, Mich.-based Gene Codes' Sequencher, focus on single operations. Although these programs may be able to perform multiple analyses to accommodate users who do not wish to purchase several applications, the software is really intended to be exceptionally efficient at a specific task. Sequencher, for example, is primarily a sequence assembly and multiple alignment package, but it can also perform routine tasks such as restriction enzyme analysis. The program is not currently capable of performing genome-scale contig assembly; users requiring this capability can use the PHRED/PHRAP package (developed by the University of Washington's Phil Green), which was used to assemble the human genome sequence.1

Companies like Oakland, Calif.-based DoubleTwist and Sunnyvale, Calif.-based Entigen, purveyor of Bionavigator.com, offer online sequence analysis applications that gives academic users and companies access to bioinformatics muscle they might not be able to otherwise afford. DoubleTwist primarily offers fee-based access to its proprietary annotated databases. However, users can also perform standard sequence analyses using the company's GeneTools and PepTools application suites, which users install on local machines. BioNavigator, in contrast, offers a Web portal to a variety of "best-of-breed" sequence analysis applications, such as Accelrys' Wisconsin Package, according to Entigen's president of informatics applications, Howard Goldstein. One advantage of BioNavigator's approach is that the system automatically ensures that data can be freely moved between disparate applications, says Goldstein.

Features

Modern sequence analysis packages carry out a whole host of functions that are designed to ease researchers' day-to-day work. Some of these functions are described briefly below.

Sequence Entry and Editing: Fundamentally, sequence analysis applications manipulate and manage long strings of characters that represent nucleotide or amino acid residues. To get this information in a form that the program can use, researchers must input sequence data either manually or from a file. These programs can usually recognize such file formats as plain text, the GenBank, FASTA, and PIR (Protein Information Resource) formats, as well as the sequence chromatogram files that automated DNA sequencers generate. Many programs also allow direct retrieval of sequence files from any of a number of Web-accessible databases, such as the National Center for Biotechnology Information (NCBI)'s GenBank database.

Sequence Annotation: Sequence annotation is the process of identifying and demarcating regions of interest, whether they are DNA elements like promoters, coding sequences, and polyadenylation signals, or important protein domains. Annotation of raw data is perhaps the most labor-intensive aspect of creating a sequence data file for submission to a major database, as most data files from major sequence and structure databases contain exhaustive feature tables that describe every known feature of the sequence. Researchers can read these tables directly, but they are best interpreted by software packages that can visually display each element.

Restriction Enzyme Analysis: Most programs allow the user to locate restriction enzyme sites on the DNA sequence and filter the results to display unique cutters, non- cutting enzymes, or digestion sites located within a defined region. These programs usually allow researchers to define custom enzyme lists and can also provide information about specific enzymes, such as whether or not a given enzyme has known isoschizomers. Some software allows researchers to keep track of which enzymes are in stock and who supplies them, and even provides a way to order out-of-stock items online.

ORF Finding and Translation: Sequence analysis packages can usually search a sequence for open reading frames (ORFs) that meet a set of user-defined criteria--for instance, if they must begin with a methionine reside, what frames to check, and how long the ORF should be. Once the desired ORF is selected, the program can translate it using the appropriate codon bias tables to create the input for a wide variety of protein analysis tools.

Primer Design: Primer designers have become standard features for most sequence analysis packages, though several companies, such as Sci-Ed Software and DNASTAR, include the feature as an add-on module rather than as a base feature. Primer designers can find PCR primers, sequencing primers, and hybridization probes, and can usually identify oligos based on user specifications that include the region to amplify, minimum length, GC content, and melting temperature. These calculators often alert the researcher if a particular primer pair can form dimers, or if a given primer will form hairpins, both of which can decrease efficiency. Once a pair of primers is selected, users can often perform a mock PCR reaction to create a new molecule, and some programs allow users to order their new oligos online.

Protein Analyses: There are literally hundreds of analyses that can be performed on peptide sequences, such as secondary structure prediction, motif domain identification, and other sequence-dependent analyses. Many software titles offer some set of protein analysis tools, either embedded within the program, or more commonly, run through remote servers. For instance, Genamics' Expression application offers users nine different protein secondary structure prediction algorithms, all of them available through the Network Protein Sequence Analysis server in France, whereas Visual Cloning 2000 passes queries through the Swiss Institute of Bioinformatics ExPASy server.

Creating Plasmid Maps: The creation of plasmid maps is a common hurdle, as researchers must produce these maps for both internal and external purposes. Creating a complicated circular map using a standard drawing application is extremely difficult, and the production of even simple maps requires the overlaying of many arcs, arrows, and circles of varying opacity. Once a map is generated, incorporating it into other documents or images is problematic, and any subsequent changes or customizations are nearly impossible to make. Most modern packages can now produce publication-quality, highly customizable plasmid maps in a variety of vector and bitmap formats. Some programs also allow the user to make composite images that document, for example, the process by which a complicated plasmid was created.

Sequence Alignments and Similarity Matching: Analysis of a sequence's similarity to one or several other sequences can help researchers identify important motifs or define gene function based on homology to another gene. Broadly, similarity-matching algorithms search for either global or local alignments in two or more sequences. Global alignment algorithms, such as Needleman-Wunsch, compare two sequences in their entirety. In contrast, local algorithms like Smith-Waterman find smaller regions of similarity, homologous domains in proteins that otherwise share no significant regions of similarity. One of the popular multiple sequence alignment algorithms is ClustalW.2 Many sequence analysis packages incorporate the ClustalW algorithm locally, whereas others farm the analysis out to remote servers.

Although these various algorithms work well if the user knows what sequences against which to compare a query sequence, they cannot be used if the identity of the query sequence is not known. To identify such sequences, researchers can use NCBI's Basic Local Alignment Search Tool (BLAST), a similarity search application. There are several BLAST programs for use with either protein or nucleic acid sequences. For example, BLASTN queries nucleic acid databases with a nucleic acid sequence, whereas BLASTP interrogates a peptide database using a peptide sequence.

Most sequence analysis programs allow researchers to query the GenBank database-or other remote or local databases-for matches to a given sequence, but how they perform that search varies. Some companies require users to interact with the NCBI Web site through a browser, but vendors can also write programs using NCBI's URL API to query the NCBI database from within the program, without a Web browser. Vector NTI and Accelrys' OMIGA, for instance, both offer this feature.

What the user can do with BLAST results also varies between programs. Some allow researchers to directly import matching sequences into the program, whereas others simply display the matches. In this case, the user must save the file to their local machine and import it, or copy-and-paste into the program. This latter approach results in the loss of any sequence annotation in the GenBank file.

Cloning in silico: The creation of a new construct file can usually be accomplished in two ways. The first method requires the user to determine precisely which bases the enzyme digest will liberate, copy that fragment to the clipboard, perform a similar operation on the target vector, and produce a new recombinant vector sequence file using standard Edit menu cut-and-paste functions. Unfortunately, this type of operation is both tedious and error-prone, especially if the ends must be modified in some way. Furthermore, although some applications will continue to associate sequence features such as coding sequences or promoters with the sequence itself, not all do, meaning that these features must be defined anew in the new recombinant plasmid.

To ease these drawbacks, several sequence analysis packages enable researchers to perform cloning reactions 'in silico,' using the molecular biologist's operational metaphors, rather than those of the software designer. Instead of cutting and pasting the appropriate sequences, researchers perform restriction digests on a given sequence, isolate specific fragments, and ligate them to produce the final product. These programs usually allow users to specify a variety of enzymatic treatments to be performed on a fragment if necessary, such as blunting with Klenow polymerase. Beyond simplifying the creation of a new sequence file, these cloning systems also alert the user if ends are incompatible, and some will even produce experimental protocols to synthesize the vector. As Douglas Ward, partner at Sci-Ed Software notes, molecular biologists perform these enzymatic reactions every day and are more comfortable with them than with cut-and-paste style operations. The in silico cloning functionality of programs like Sci-Ed's Clone Manager and Informax's Vector NTI enables these researchers to work with whichever metaphor they are most at ease with.

For those researchers making vectors via techniques other than standard restriction enzyme digestion and ligation, Palo Alto, Calif.-based PREMIER Biosoft offers SimVector. Like its competitors, this package accommodates restriction enzyme-based cloning but also enables TA-based and Cre recombinase-based strategies. According to Arun Apte, SimVector product manager, the upcoming version 2.0 of the program will further enable researchers to construct vectors using Carlsbad, Calif.-based Invitrogen's Gateway™ cloning system, which is like Cre, except that it enables directional cloning strategies.

Gel Prediction: To ease the actual experimental construction of new plasmid constructs, a few sequence analysis programs, including Informax's Vector NTI and West Lebanon, N.H.-based Textco's Gene Construction Kit, offer a gel prediction function. This feature allows the user to specify DNA samples, run conditions, and the DNA marker used, thereby enabling the user to test electrophoresis conditions without actually wasting time or precious samples. Gel prediction can be a useful check when trying to determine whether the isolation of a specific fragment from a collection of fragments is even possible, eliminating one common source of lab frustration.

Security Concerns

In general, academic users don't need to concern themselves with security issues more commonly associated with industrial clients. Therefore these users can run BLAST searches remotely against the NCBI GenBank database and make computations on amino acid sequences using the proteomic tools found on the ExPASy server. However, users or commercial entities working with proprietary sequences or databases do not always have this luxury. Fortunately, most bioinformatics applications and databases can be installed locally, behind a corporate or institutional firewall, allowing data analyses to be conducted in total security.

According to Nick Tsinoremas, DoubleTwist's vice president of Genomics, the purchase of local copies of databases is not limited to industrial customers. Academic clients intent on centralizing bioinformatics resources have begun purchasing local copies of large bioinformatics resources for their universities as well. He notes that some large academic institutions have purchased local copies of the DoubleTwist databases.

Laboratories that are ready to venture into the sequence analysis software market find there are options for most budgets, and even free options are available for money-conscious researchers, occasional users, and educators including San Francisco-based LabVelocity's Jellyfish (www.biowire.com), the San Diego Supercomputer Center's Biology Workbench (workbench.sdsc.edu), and the Sequence Manipulation Suite (www.bioinformatics.org/sms/index.html).

Because sequence analysis software's feature sets and ease-of-use vary widely, the best way to make a buying decision is to scrutinize the free software demos or tutorials that most companies place on their Web sites. Informax will even interactively demonstrate the program for potential users online over a Web link, providing the opportunity to ask questions and clarify issues directly.

Despite remarkable advances, the field of sequence analysis software continues its long transition. Scientists can now perform many of the tasks previously carried out by desktop software over the Web, says PREMIER Biosoft's Arun Apte. These Web-based applications can be updated more rapidly than traditional software, and tend to be more powerful. Furthermore, Apte says that the techniques the traditional software is based on are largely obsolete. For example, protein biochemistry has changed dramatically in the last decade or so, and standard biochemical analyses have been supplanted by mass spectrometry. "So," he concludes, "I basically think that the traditional sequence analysis packages are dead."

But, Sci-Ed's Douglas Ward cautions that it is not clear how much longer the exponential explosion in Web interactivity will continue in the desktop software market. He notes, for example, that the speed of certain database searches can be quite long owing to increased network traffic. That, coupled with an increasing number of attacks on Web sites may force companies to offer both local and Web-based versions of popular tools.

Nevertheless, change is coming. According to Entigen's Howard Goldstein, an attendee at the 2001 Genome Sequencing and Analysis Conference in San Diego, the current buzz in the field is integration: "The questions that are now being asked are multi- dimensional. 'What genes influence what diseases?' 'What metabolic pathways are influenced by what family history?' ... You now need to take pharmacological data and toxicology data and clinical data, and combine it with genomic data. Even if you just think you're doing sequence analysis, that's not good enough anymore." In other words, he says, sequence analysis is a beginning, not an end, as researchers begin to contemplate complex, systems biology questions.

Jeffrey M. Perkel can be contacted at jperkel@the-scientist.com.
References
1. International Human Genome Sequencing Consortium, "Initial sequencing and analysis of the human genome," Nature, 409:860-921, Feb. 15, 2001.

2. J.D. Thompson et al., "CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," Nucleic Acids Research, 22:4673-80, 1994.



Supplemental Materials


get