Microarray technology allows the simultaneous monitoring of expression levels of thousands of genes and even whole genomes. But the experiments themselves represent only the opening act. The grand finale, so to speak, is the data analysis. Sophisticated statistical software and tools exist, but the sheer volume of data, the likes of which are unfamiliar to most biologists (and most statisticians), still represents the primary hurdle in DNA microarray research.
"Five years ago, there were no statisticians in the field, and molecular biologists were not talking to statisticians," says Thomas J. Downey, president and CEO of Partek, a St. Charles, Mo.-based provider of statistical and visual pattern-recognition software for scientists. "I was saying, 'Hey, we need to be using statistics on this data,' and they would throw me out of the room." But things are different now, he says: "More statisticians have entered the field, although I think there is plenty...
NEW AND IMPROVED
VizX Labs in Seattle recently upgraded its easy-to-use GeneSifter.net, a Web-based analysis package. The product includes new features specific for Affymetrix platform users, incorporating a new pathways report, CLARA clustering analytics, a box-plot option for use in quality control, enhanced dendrograms, additional sorting and searching options, and improved ability to construct gene lists.
Company vice president, Elon Gasper, says, "GeneSifter.net handles lists in a way that doesn't require researchers to go through on a gene-by-gene basis to find the pattern of expression their arrays are showing. In addition, the gene ontology operations allow the user to look at the biologic sequence at the same time that they're looking at statistical signals."
Ingenuity Systems of Mountain View, Calif., has recently upgraded its Pathways Analysis software. The program is a Web-delivered application that enables biologists to explore therapeutically relevant networks significant to their gene-or protein-expression array datasets. Using the product, a researcher can cross-reference biological networks generated from their datasets with well-known pathways. The software also allows researchers to expand from a single network to visualize other biologically related networks and contains new gene and protein dataset mapping, including two new Affymetrix arrays.
InforSense, based in London, recently introduced a new application module for its InforSense 1.9: BioScience package that eases analysis of genomics and proteomics data. According to the company, the upgrade directly integrates sequences, expression data, and ontology-based knowledge for a wide range of analytical applications in genomics and proteomics studies. Example applications include integrative gene-expression analysis and high-throughput gene and protein annotation.
Likewise, Iobion Informatics, based in La Jolla, Calif., recently released PathwayAssist 2.5, desktop software for biological pathway visualization and analysis. New features include molecular-network databases for four model organisms (
For thrifty scientists, or even those skilled in programming, the open-source software movement offers analysis tools at no cost. Bioconductor
Aedene Culhane, a researcher with University College Dublin and a collaborator with the European Bioinformatics Institute, says Bioconductor has been extremely useful because it provides a resource where researchers can put their scripts and methods. "It's a way of galvanizing methods so people can have access to other people's source code and they can see whether the method is a valid method or not," she says. And that reduces the need for bioinformaticians to reinvent the wheel, she adds. "There used to be a tendency of people reinventing the same method, but just calling it something different."
Another open-source microarray analysis package is TM4, developed by John Quackenbush's group at The Institute for Genomics Research in Rockville, Md. TM4 consists of four major applications, Microarray Data Manager, TIGRSpotfinder, Microarray Data Analysis System, and Multiexperiment Viewer, as well as a MIAME (Minimal Information About a Microarray Experiment) – compliant MySQL database.
MIAME, a standard developed by the Microarray Gene Expression Data Society (MGED), lays out the absolute minimum amount of information researchers should provide to allow experiment reproducibility. Several journals now require researchers to meet MIAME requirements when submitting research for publication. "Increasing efforts have been made to standardize the way in which data is reported and shared," Culhane says.
"MGED is trying to use standard object models for microarray data as well as data for proteomics, mass spectrometry data, and 2-D gels," says Culhane. "We hope that there will be standard ontologies for all of these so that researchers can do better database queries and more effective data analysis," she adds.
ERRORS IN MATH
The major trend in the last couple of years has been to analyze microarray data relative to genetic processes and pathways, Olshen says. "Someone may identify 500 genes that seem interesting in the experiment, but then it is extremely time consuming for the biologist to go through those lists to try to figure out what those genes mean and their relationship to biological processes. Now, with the help of many types of software, people are much better able to process data on the genes of interest and interpret the meaning of their experiment," he says.
Still, major errors are often made in the statistics of microarray papers, says Olshen. "The one mistake that's made in about half of all microarray papers, is when people cluster data. If you have two cancer samples, and you find that all the data from cancer one clusters together and all the data from cancer two clusters together, this is interesting because that means that the gene-expression profile and the two samples may be different."
But then the researchers make a mistake, he says. Instead of selecting all 40,000 genes, they instead choose a subset of genes that can distinguish the two cancer types. It is a subtle error, Olshen says, but it is one that makes expression patterns appear distinguished when they are not.
"Another common error is that people will develop prediction models to determine whether they have cancer type one or cancer type two from their gene-expression data," Olshen says. "But they'll only have, say, five or 10 samples with which they are working." Correctly predicting four samples out of five gives an accuracy rate of 80%. "But with only five samples, there's a huge variability in that number. The point is that you look at sample sizes that are so small that the variability of your prediction as to make it meaningless," he concludes.
So researchers need more than just solid statistical packages when working with array data, says Wolfgang Huber, with the molecular genome analysis division at the German Cancer Research Center in Heidelberg. "You need a good experimental design and directed biological questions to guide the analysis of microarray data."
"Often, there are a lot of transcriptional changes associated with a certain biological condition or disease, but only a few of them play a causal role," Huber adds. "The challenge is to isolate these from the mass of bystanders. You need to pull together all available metadata or additional datasets to achieve this."
Selected Suppliers (63K)