Biology by the Numbers

When the graduate students and postdocs in Martin Wilson's lab at the University of California, Davis, need to do image processing, they look to an unlikely source.

Jun 20, 2005
Jeffrey Perkel(jperkel@the-scientist.com)
<p/>

When the graduate students and postdocs in Martin Wilson's lab at the University of California, Davis, need to do image processing, they look to an unlikely source. Instead of off-the-shelf image analysis software, Wilson's team, which studies synapses in the retina, uses home-brewed algorithms running within the mathematical package Matlab, from The Mathworks, Inc., in Natick, Mass.

"Whenever we get images, we apply these routines that we've written to the images to massage the data, improve the signal-to-noise ratio, to do the things to get quantitative images that can tell us something interesting," says Wilson, a professor of neurobiology, physiology, and behavior.

They can do that, because Matlab isn't so much a program as it is a programming language, like Perl or C. The advantage of using such languages, says Kristen Amuzzini, biotech and pharmaceutical marketing manager at The Mathworks, is that programmers need not "reinvent the wheel" to implement algorithms that have already been developed. Take k-means clustering, for instance, which is an algorithm used to cluster microarray data. "In Matlab that's one line of code," says Amuzzini. "In other languages it might be 30 or 50 lines of code."

Matlab is just one of a number of math-centric languages, both open source and commercial, that researchers can use to write customized, mathematically intensive applications for their lab; everything from filtering noise from experimental time series data to running systems biology models. Though intended for a broader audience than life scientists, these packages nevertheless are powerful applications and languages with a significant presence in the academic, biotech, and pharmaceutical communities.

TO THE LETTER: R AND S

If http://www.bioinformatics.org is any indication, most biologists in need of advanced mathematics take the no-cost route. The blog recently asked its readers, "Which math/statistics language/application do you most frequently use?" A plurality (24%) of the survey's 1,675 or so respondents chose R, an open-source statistical language and environment written more than a dozen years ago by Ross Ihaka of the University of Auckland, New Zealand, and Robert Gentleman, currently at the Fred Hutchinson Cancer Research Center in Seattle http://www.r-project.org.

R is an implementation of S, a statistical computing language developed at AT&T Bell Labs. "It's basically a programming language with a fairly rich set of tools for doing statistical and other models, fairly high quality graphics," says Gentleman. The language and environment run in Windows, MacOS, and Linux; version 2.1.0 was released this past April.

The language's utility in bioinformatics is clear: Three years ago Gentleman, who calls himself a computational biologist rather than a bioinformatician ("I don't know what such a beast is," he quips), used it to organize a collaborative project called Bioconductor http://www.bioconductor.org. An open-source, extendable bioinformatics toolset, Bioconductor's most recent version features nearly 100 separate modules. Though Gentleman and his team coordinate the project, the bulk of the software has been written and contributed by others; packages exist to process microarray, mass spectrometric, and array comparative genomic hybridization data; deal with databases; and handle graphics, among other features.

Reaction to the project has been positive, says Gentleman, "certainly more positive than I imagined." Gentleman doesn't track downloads, but Bioconductor's mailing list does sport some 1,200 addresses.

Insightful Corp. of Seattle bought the rights to S software from Lucent Technologies in 2004 and now offers a commercial version of the language called S-Plus. Though largely compatible with R, differences do exist. But the original developers of S and R, among other languages, are collaborating on new work that will help researchers get the most out of existing statistical software, says John Chambers, who led the original design of S at Bell Labs Research in Murray Hill, NJ, and is now a member emeritus there. The Omegahat Project http://www.omegahat.org is an example: "One of the things that Omegahat does is to provide interfaces between the different systems, R and Matlab, R and S, and so on," he says.

MATLAB

Matlab garnered second place in the bioinformatics.org survey with 20%. Currently at version 7.0.4 and available for Windows, Mac OS, Linux, and Unix, Matlab costs $1,900 for a single-user commercial license.

Matlab is both a development environment and a language. For simple mathematical tasks, a user can execute instructions in the program's command window, much as one would execute a dir (directory) command at a DOS prompt. For more complicated work, Matlab has a programming editor, in which full-fledged applications may be built and tested. Those programs may then be run from within the environment, or compiled as stand-alone executable files or Microsoft Excel plug-ins (though this capability requires additional software from The Mathworks).

Selected Suppliers

Aptech Systems (Gauss) http://www.aptech.com

Insightful Corp. (S-Plus) http://www.insightful.com

Maplesoft (Maple) http://www.maplesoft.com

Mathsoft (MathCAD) http://www.mathsoft.com

The Mathworks Inc. (Matlab) http://www.mathworks.com

Wolfram Research (Mathematica) http://www.wolfram.com

Complementing Matlab's intrinsic feature-set is a collection of tool-boxes, bundled groups of 50 to 100 functions that enhance Matlab's existing functionality in a particular area. Popular in the biological space, says Amuzzini, are toolboxes dedicated to statistics, distributed computing, and bioinformatics. The Bioinformatics Toolbox, currently at version 2.1, costs $1,000 for a single-user commercial license.

Most of the source code for the toolboxes, like Matlab's internal functions, is available to users, allowing them to both see how algorithms are implemented and to customize them. Wilson's UC Davis team uses and extends the Image Processing Toolbox, an add-on package that allows them to, among other things, load image files, detect edges, and count pixels.

Unsatisfied with Matlab's Bioinformatics Toolbox, James Cai, a PhD student at the University of Hong Kong, developed a custom package called MBEToolbox.1 "I regard MBEToolbox as an algorithm prototyping environment for molecular evolutionists," Cai writes in an E-mail. "But looking back, I started MBEToolbox programming as a self-learning process." Other newly published Matlab toolboxes help users process microarray data and combine functional imaging data with structural maps of the brain, among other things.23

For those put off by Matlab's cost and willing to tough it out with a less polished application, several open source Matlab clones exist. These include Octave (http://www.octave.org; 2% of the survey votes), Scilab http://www.scilab.org, and Rlab http://rlab.sourceforge.net. Though none is 100% compatible with Matlab (that is, Matlab source files may or may not work correctly), each provides a subset of its features.

MATHEMATICA

Five percent of those who responded to the bioinformatics.org poll develop mathematical applications in Mathematica, from Wolfram Research of Champaign, Ill. (Wolfram Research recently entered into a partnership with BioMed Central, a sister company of The Scientist, to develop and sell software that eases the preparation of open-access articles.) Mohammed AlQuraishi, lead developer in Wolfram's computational biology group, says Mathematica incorporates three elements: a core mathematics engine; a programming language; and a notebook front-end that allows users to author scientific and mathematical documents, much as they write text documents in Word.

Mathematica's strength is in symbolic math – that is, it is capable of solving algebraic equations to give another equation. Languages like Matlab and Gauss, in contrast, excel in numeric math, solving equations with data to provide an answer.

This prowess with symbolic math has made Mathematica (current version 5.1.1) a potent tool for systems biologists. Eric Mjolsness, associate professor of computer science at UC-Irvine, in collaboration with Bruce Shapiro of NASA's Jet Propulsion Laboratory, used the package to develop Cellerator http://www.cellerator.info, for instance. Given a biological pathway, Cellerator produces a series of differential equations that can be solved in Mathematica, or exported to another language like C, C++, and SBML (systems biology markup language).

Christian Jacob, associate professor of computer science and biochemistry and molecular biology at the University of Calgary, has used Mathematica to simulate macromolecular and cellular interactions during immune reactions and at the lactose operon – though he now writes his own code in C++, because Mathematica had trouble handling the size of his models.

Unlike Matlab, Mathematica notebooks cannot be compiled into stand-alone applications, meaning anyone wishing to run an algorithm written in Mathematica must own the program; the professional edition costs $1,880 for a single-user license in Windows, MacOS, or Linux; other Unix implementations cost $3,135.

GAUSS

Gauss, from Aptech Systems in Seattle, is another mathematical language. "In a nutshell," says company president Sam Jones, "Gauss is a workhorse, number-crunching language with an easy-to-use syntax. You're not required to write thousands of lines of code to solve one problem."

For instance, Gauss (like Matlab and S-Plus) defines a matrix data type. That is, it allows programmers to treat matrices the same as they treat simple data types like integers. So, it can multiply two matrices, a and b, (for example, nucleotide substitution probabilities) with the simple expression a*b. To perform such an operation in C or FORTRAN, Jones explains, would require the programmer to write three nested loops of code, keep track of the matrix size, and watch the program's memory allocation. "So having a matrix data type is pretty fundamental," he concludes.

What distinguishes Gauss from its competitors, according to Jones, is speed. "That is our priority, making it fast." Currently at version 6.1, Gauss costs around $2,495 for a single-user, commercial license; it is available for Windows, MacOS, Linux, and other Unix variants. As with Matlab, Gauss function source code is available. And developing stand-alone applications is possible using either a freely distributable runtime module or a Gauss engine, which allows developers to incorporate Gauss functions into their programs (from $3,995).

Other software solutions exist, of course. Programs like Maple (a symbolic math package that received 3% of votes in the bioinformatics.org poll) and MathCAD, for instance, can perform some of the functions of the applications discussed above. And it is possible to write math software in "lower-level" languages like C or C++ (see, for instance, Numerical Recipes in C++, Cambridge University Press, 2002).

Ultimately, the choice of programming language often has as much to do with workplace culture, cost, and convenience as with anything else, so unless your application requires a specific function, you might be fine with any available package. Nevertheless, Stefan Steinhaus, a business intelligence consultant for Siemens Business Services in Munich, Germany, has for several years been systematically comparing mathematical packages. His most recent report, from September 2004 (version 4.42, available online at http://www.scientificweb.de/ncrunch/ncrunch4.pdf) picks Matlab 6.5 as the overall best package, just narrowly ahead of Gauss 5.0.