Expanding Options in Data Analysis

Microarray technology allows the simultaneous monitoring of expression levels of thousands of genes and even whole genomes. But the experiments themselves represent only the opening act. The grand finale, so to speak, is the data analysis. Sophisticated statistical software and tools exist, but the sheer volume of data, the likes of which are unfamiliar to most biologists (and most statisticians), still represents the primary hurdle in DNA microarray research."Five years ago, there were no stati

Emma Hitt
Jul 4, 2004

Microarray technology allows the simultaneous monitoring of expression levels of thousands of genes and even whole genomes. But the experiments themselves represent only the opening act. The grand finale, so to speak, is the data analysis. Sophisticated statistical software and tools exist, but the sheer volume of data, the likes of which are unfamiliar to most biologists (and most statisticians), still represents the primary hurdle in DNA microarray research.

"Five years ago, there were no statisticians in the field, and molecular biologists were not talking to statisticians," says Thomas J. Downey, president and CEO of Partek, a St. Charles, Mo.-based provider of statistical and visual pattern-recognition software for scientists. "I was saying, 'Hey, we need to be using statistics on this data,' and they would throw me out of the room." But things are different now, he says: "More statisticians have entered the field, although I think there is plenty...


VizX Labs in Seattle recently upgraded its easy-to-use GeneSifter.net, a Web-based analysis package. The product includes new features specific for Affymetrix platform users, incorporating a new pathways report, CLARA clustering analytics, a box-plot option for use in quality control, enhanced dendrograms, additional sorting and searching options, and improved ability to construct gene lists.

Company vice president, Elon Gasper, says, "GeneSifter.net handles lists in a way that doesn't require researchers to go through on a gene-by-gene basis to find the pattern of expression their arrays are showing. In addition, the gene ontology operations allow the user to look at the biologic sequence at the same time that they're looking at statistical signals."

Ingenuity Systems of Mountain View, Calif., has recently upgraded its Pathways Analysis software. The program is a Web-delivered application that enables biologists to explore therapeutically relevant networks significant to their gene-or protein-expression array datasets. Using the product, a researcher can cross-reference biological networks generated from their datasets with well-known pathways. The software also allows researchers to expand from a single network to visualize other biologically related networks and contains new gene and protein dataset mapping, including two new Affymetrix arrays.

InforSense, based in London, recently introduced a new application module for its InforSense 1.9: BioScience package that eases analysis of genomics and proteomics data. According to the company, the upgrade directly integrates sequences, expression data, and ontology-based knowledge for a wide range of analytical applications in genomics and proteomics studies. Example applications include integrative gene-expression analysis and high-throughput gene and protein annotation.

Likewise, Iobion Informatics, based in La Jolla, Calif., recently released PathwayAssist 2.5, desktop software for biological pathway visualization and analysis. New features include molecular-network databases for four model organisms (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Arabidopsis thaliana), the capability to map gene-expression data onto KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways, and the ability to save pathways in HTML format as interactive, clickable maps. In addition, the new version features integration with GeneSpring and ArrayAssist gene-expression analysis software. "The database performance has been increased fivefold, and we have added an 'Update Pathway' feature, instantly extracting new information about a specified protein from PubMed," Iobion says.


For thrifty scientists, or even those skilled in programming, the open-source software movement offers analysis tools at no cost. Bioconductor http://www.bioconductor.org, for example, is a collaborative open-development software project started by researchers at the Dana-Farber Cancer Institute, Boston. "It is a way for people to share their software," says Adam Olshen, assistant attending-biostatistician in the department of epidemiology, Memorial Sloan-Kettering Cancer Center, New York. "In my opinion, it is one of the important recent advances in the field," he says. According to the Bioconductor Web site, many of the software tools are general and can be used broadly for the analysis of genomic data, such as SAGE (serial analysis of gene expression), single nucleotide polymorphisms, or sequence data.

Aedene Culhane, a researcher with University College Dublin and a collaborator with the European Bioinformatics Institute, says Bioconductor has been extremely useful because it provides a resource where researchers can put their scripts and methods. "It's a way of galvanizing methods so people can have access to other people's source code and they can see whether the method is a valid method or not," she says. And that reduces the need for bioinformaticians to reinvent the wheel, she adds. "There used to be a tendency of people reinventing the same method, but just calling it something different."

Another open-source microarray analysis package is TM4, developed by John Quackenbush's group at The Institute for Genomics Research in Rockville, Md. TM4 consists of four major applications, Microarray Data Manager, TIGRSpotfinder, Microarray Data Analysis System, and Multiexperiment Viewer, as well as a MIAME (Minimal Information About a Microarray Experiment) – compliant MySQL database.

MIAME, a standard developed by the Microarray Gene Expression Data Society (MGED), lays out the absolute minimum amount of information researchers should provide to allow experiment reproducibility. Several journals now require researchers to meet MIAME requirements when submitting research for publication. "Increasing efforts have been made to standardize the way in which data is reported and shared," Culhane says.

"MGED is trying to use standard object models for microarray data as well as data for proteomics, mass spectrometry data, and 2-D gels," says Culhane. "We hope that there will be standard ontologies for all of these so that researchers can do better database queries and more effective data analysis," she adds.


The major trend in the last couple of years has been to analyze microarray data relative to genetic processes and pathways, Olshen says. "Someone may identify 500 genes that seem interesting in the experiment, but then it is extremely time consuming for the biologist to go through those lists to try to figure out what those genes mean and their relationship to biological processes. Now, with the help of many types of software, people are much better able to process data on the genes of interest and interpret the meaning of their experiment," he says.

Still, major errors are often made in the statistics of microarray papers, says Olshen. "The one mistake that's made in about half of all microarray papers, is when people cluster data. If you have two cancer samples, and you find that all the data from cancer one clusters together and all the data from cancer two clusters together, this is interesting because that means that the gene-expression profile and the two samples may be different."

But then the researchers make a mistake, he says. Instead of selecting all 40,000 genes, they instead choose a subset of genes that can distinguish the two cancer types. It is a subtle error, Olshen says, but it is one that makes expression patterns appear distinguished when they are not.

"Another common error is that people will develop prediction models to determine whether they have cancer type one or cancer type two from their gene-expression data," Olshen says. "But they'll only have, say, five or 10 samples with which they are working." Correctly predicting four samples out of five gives an accuracy rate of 80%. "But with only five samples, there's a huge variability in that number. The point is that you look at sample sizes that are so small that the variability of your prediction as to make it meaningless," he concludes.

So researchers need more than just solid statistical packages when working with array data, says Wolfgang Huber, with the molecular genome analysis division at the German Cancer Research Center in Heidelberg. "You need a good experimental design and directed biological questions to guide the analysis of microarray data."

"Often, there are a lot of transcriptional changes associated with a certain biological condition or disease, but only a few of them play a causal role," Huber adds. "The challenge is to isolate these from the mass of bystanders. You need to pull together all available metadata or additional datasets to achieve this."

Emma Hitt emma@emmasciencewriter.com is a freelance writer in Marietta, Ga.

Article Extras

Related Articles

Technology | Microarray Advances

Technology | One Chip, One Genome

How It Works | The Microarray Scanner

Technology | Advances in Microarray Readers

Related PDF

Selected Suppliers (63K)

Interested in reading more?

Become a Member of

Receive full access to digital editions of The Scientist, as well as TS Digest, feature stories, more than 35 years of archives, and much more!
Already a member?