# Biology by the Numbers

## When the graduate students and postdocs in Martin Wilson's lab at the University of California, Davis, need to do image processing, they look to an unlikely source.

###### Jun 20, 2005

##### Jeffrey Perkel(jperkel@the-scientist.com)

When the graduate students and postdocs in Martin Wilson's lab at the University of California, Davis, need to do image processing, they look to an unlikely source. Instead of off-the-shelf image analysis software, Wilson's team, which studies synapses in the retina, uses home-brewed algorithms running within the mathematical package Matlab, from The Mathworks, Inc., in Natick, Mass.

"Whenever we get images, we apply these routines that we've written to the images to massage the data, improve the signal-to-noise ratio, to do the things to get quantitative images that can tell us something interesting," says Wilson, a professor of neurobiology, physiology, and behavior.

They can do that, because Matlab isn't so much a program as it is a programming language, like Perl or C. The advantage of using such languages, says Kristen Amuzzini, biotech and pharmaceutical marketing manager at The Mathworks, is that programmers need not "reinvent the wheel" to implement algorithms that have already been developed. Take k-means clustering, for instance, which is an algorithm used to cluster microarray data. "In Matlab that's one line of code," says Amuzzini. "In other languages it might be 30 or 50 lines of code."

Matlab is just one of a number of math-centric languages, both open source and commercial, that researchers can use to write customized, mathematically intensive applications for their lab; everything from filtering noise from experimental time series data to running systems biology models. Though intended for a broader audience than life scientists, these packages nevertheless are powerful applications and languages with a significant presence in the academic, biotech, and pharmaceutical communities.

## TO THE LETTER: R AND S

If

R is an implementation of S, a statistical computing language developed at AT&T Bell Labs. "It's basically a programming language with a fairly rich set of tools for doing statistical and other models, fairly high quality graphics," says Gentleman. The language and environment run in Windows, MacOS, and Linux; version 2.1.0 was released this past April.

The language's utility in bioinformatics is clear: Three years ago Gentleman, who calls himself a computational biologist rather than a bioinformatician ("I don't know what such a beast is," he quips), used it to organize a collaborative project called Bioconductor

Reaction to the project has been positive, says Gentleman, "certainly more positive than I imagined." Gentleman doesn't track downloads, but Bioconductor's mailing list does sport some 1,200 addresses.

Insightful Corp. of Seattle bought the rights to S software from Lucent Technologies in 2004 and now offers a commercial version of the language called S-Plus. Though largely compatible with R, differences do exist. But the original developers of S and R, among other languages, are collaborating on new work that will help researchers get the most out of existing statistical software, says John Chambers, who led the original design of S at Bell Labs Research in Murray Hill, NJ, and is now a member emeritus there. The Omegahat Project

## MATLAB

Matlab garnered second place in the bioinformatics.org survey with 20%. Currently at version 7.0.4 and available for Windows, Mac OS, Linux, and Unix, Matlab costs $1,900 for a single-user commercial license.

Matlab is both a development environment and a language. For simple mathematical tasks, a user can execute instructions in the program's command window, much as one would execute a dir (directory) command at a DOS prompt. For more complicated work, Matlab has a programming editor, in which full-fledged applications may be built and tested. Those programs may then be run from within the environment, or compiled as stand-alone executable files or Microsoft Excel plug-ins (though this capability requires additional software from The Mathworks).

**Selected Suppliers**

Aptech Systems (Gauss)

Insightful Corp. (S-Plus)

Maplesoft (Maple)

Mathsoft (MathCAD)

The Mathworks Inc. (Matlab)

Wolfram Research (Mathematica)

Complementing Matlab's intrinsic feature-set is a collection of tool-boxes, bundled groups of 50 to 100 functions that enhance Matlab's existing functionality in a particular area. Popular in the biological space, says Amuzzini, are toolboxes dedicated to statistics, distributed computing, and bioinformatics. The Bioinformatics Toolbox, currently at version 2.1, costs $1,000 for a single-user commercial license.

Most of the source code for the toolboxes, like Matlab's internal functions, is available to users, allowing them to both see how algorithms are implemented and to customize them. Wilson's UC Davis team uses and extends the Image Processing Toolbox, an add-on package that allows them to, among other things, load image files, detect edges, and count pixels.

Unsatisfied with Matlab's Bioinformatics Toolbox, James Cai, a PhD student at the University of Hong Kong, developed a custom package called MBEToolbox.^{1} "I regard MBEToolbox as an algorithm prototyping environment for molecular evolutionists," Cai writes in an E-mail. "But looking back, I started MBEToolbox programming as a self-learning process." Other newly published Matlab toolboxes help users process microarray data and combine functional imaging data with structural maps of the brain, among other things.^{23}

For those put off by Matlab's cost and willing to tough it out with a less polished application, several open source Matlab clones exist. These include Octave (

## MATHEMATICA

Five percent of those who responded to the bioinformatics.org poll develop mathematical applications in Mathematica, from Wolfram Research of Champaign, Ill. (Wolfram Research recently entered into a partnership with BioMed Central, a sister company of

Mathematica's strength is in symbolic math – that is, it is capable of solving algebraic equations to give another equation. Languages like Matlab and Gauss, in contrast, excel in numeric math, solving equations with data to provide an answer.

This prowess with symbolic math has made Mathematica (current version 5.1.1) a potent tool for systems biologists. Eric Mjolsness, associate professor of computer science at UC-Irvine, in collaboration with Bruce Shapiro of NASA's Jet Propulsion Laboratory, used the package to develop Cellerator

Christian Jacob, associate professor of computer science and biochemistry and molecular biology at the University of Calgary, has used Mathematica to simulate macromolecular and cellular interactions during immune reactions and at the lactose operon – though he now writes his own code in C++, because Mathematica had trouble handling the size of his models.

Unlike Matlab, Mathematica notebooks cannot be compiled into stand-alone applications, meaning anyone wishing to run an algorithm written in Mathematica must own the program; the professional edition costs $1,880 for a single-user license in Windows, MacOS, or Linux; other Unix implementations cost $3,135.

## GAUSS

Gauss, from Aptech Systems in Seattle, is another mathematical language. "In a nutshell," says company president Sam Jones, "Gauss is a workhorse, number-crunching language with an easy-to-use syntax. You're not required to write thousands of lines of code to solve one problem."

For instance, Gauss (like Matlab and S-Plus) defines a matrix data type. That is, it allows programmers to treat matrices the same as they treat simple data types like integers. So, it can multiply two matrices, a and b, (for example, nucleotide substitution probabilities) with the simple expression a*b. To perform such an operation in C or FORTRAN, Jones explains, would require the programmer to write three nested loops of code, keep track of the matrix size, and watch the program's memory allocation. "So having a matrix data type is pretty fundamental," he concludes.

What distinguishes Gauss from its competitors, according to Jones, is speed. "That is our priority, making it fast." Currently at version 6.1, Gauss costs around $2,495 for a single-user, commercial license; it is available for Windows, MacOS, Linux, and other Unix variants. As with Matlab, Gauss function source code is available. And developing stand-alone applications is possible using either a freely distributable runtime module or a Gauss engine, which allows developers to incorporate Gauss functions into their programs (from $3,995).

Other software solutions exist, of course. Programs like Maple (a symbolic math package that received 3% of votes in the bioinformatics.org poll) and MathCAD, for instance, can perform some of the functions of the applications discussed above. And it is possible to write math software in "lower-level" languages like C or C++ (see, for instance,

Ultimately, the choice of programming language often has as much to do with workplace culture, cost, and convenience as with anything else, so unless your application requires a specific function, you might be fine with any available package. Nevertheless, Stefan Steinhaus, a business intelligence consultant for Siemens Business Services in Munich, Germany, has for several years been systematically comparing mathematical packages. His most recent report, from September 2004 (version 4.42, available online at