Like it or not, biologists have to deal increasingly in informatics as their experiments generate ever larger and more complex data sets. Few laboratories have the resources to develop their own microarray analysis software, so they must use one of the many commercial packages. With the list of options growing fast, Affymetrix of Santa Clara, Calif., one of the world's leading makers of DNA microarrays, hit on a novel way to help its customers sample the field: They held a series of Web-based seminars, or "webinars."
The company invited leading vendors of microarray analysis software to run their products on a common data set and present their results over the Web, taking written questions afterwards. The event was a win-win-win situation: Software developers got the opportunity to showcase their products; customers could compare and contrast software they might otherwise never get to see; and Affymetrix was able to demonstrate (without...
The data set was derived from a model mouse cell line, called MPRO, which was developed by the Collins Lab at the Fred Hutchison Cancer Research Center in Seattle. The aim was to study the mechanisms by which retinoic acid (vitamin A) signaling causes promyelocytes to stop dividing and turn into mature neutrophils, which provide the first line of defense against bacterial infection in mammals.
The main hypothesis: Retinoic acid signaling sets off a cascade of gene-expression changes that underlie a profound change in the molecular phenotype of the cell, transforming it from a dividing promyelocyte into a terminally differentiated neutrophil, whose eventual apoptosis ends the story. The chosen cell line carries a dominant negative form of the retinoic acid receptor, which blocks the cell's normal granulocytic differentiation pathway.
The experiment kicks off with the addition of exogenous retinoic acid, which starts the differentiation and associated cascade of events. Data was collected at both hourly and daily time points, with four biological samples taken at each point to reduce noise and random effects. The hourly data set (0, 1, 2, 4, and 8 hours) tracked the smaller-scale but critical early changes in gene expression as cell differentiation starts. The daily course (0, 1, 2, 3, 4, 5, and 6 days) homed in on the stronger signals leading towards terminal differentiation and eventual cell death.
Of the 12 vendors Affymetrix invited, eight agreed to take part in the February event. Each was presented with the same data set (see Box), which was designed to assess the flexibility and statistical rigor of the software products. A key aspect of the data set was its division into twin time courses, one hourly and one daily, to assess the software's ability to identify expression changes at different levels and time scales.
The data set itself was widely applauded by the participants. "It was a very good data set, and with the time points we were able to show some of the things that you can't do just with straight visualization of the data," says Michael O'Connell, director of biopharm solutions at Insightful, one of the eight participants.
Some of the participating vendors complained, however, that the event failed to exhibit the broader capabilities of the various products, such as their ability to provide a platform for integrating data from a wide variety of sources, classify groups of genes, or prepare figures for publication either online or in print. "The event was just one example of how the software could be used," says Steve Misener, who presented the webinar for Inforsense, another of the participating vendors.
Jokerst concedes that the event did not necessarily show all vendors in the same light, and that some of those declining the invitation to participate may have done so for fear of failing to match up to some of their rivals on the chosen data set. But, Jokerst also points out that the eight participating vendors covered a wide spectrum of requirements, from the single PhD researcher to large pharmaceutical companies working on drug discovery.
One of the presenters, VizX Labs, caters to the less-sophisticated consumer, aiming to move the power of microarray technology into smaller laboratories, according to CEO and founder Tom Ranken. "There's a lot of fine products out there, but you need to be an informatics expert to use them, and there's not until now been anything for the individual scientist working on cancer, for example," says Ranken. VizX Labs' GeneSifter is entirely Web-based, avoiding the need for high-power hardware in-house, and it is strong on visualization. But, the software lacks the full range of analysis capabilities needed for some high-throughput projects that attempt to measure expression levels across large numbers of genes, or even the whole genome.
In contrast, Inforsense admits that its GeneSense is expensive, sophisticated, and unsuitable for smaller laboratories. Misener, however, insists that the software works perfectly well for small experiments, and he says the investment would be well justified for a laboratory planning to scale up to larger multiple-gene analyses.
DIFFERENT APPS, SAME CONCLUSIONS
Despite the differences in price and scale, little on the surface differentiated the eight products in the broad conclusions they reached on the Affymetrix dataset, except for subtle variations and differences in emphasis. All participants concluded that genes involved in DNA replication and carbohydrate metabolism were upregulated within an hour or two of the start as cell differentiation and division resumed, while genes involved in chemotaxis were initially repressed.
All results also showed that cells involved in immune response and chemotaxis, after initially being down-regulated, came up over the daily time scale, as mature neutrophils with the potential to respond to invasion were produced. At the same time, genes involved in protein synthesis and RNA metabolism were shut down over the daily time scale, as the need for cell differentiation and division abated.
Agreement over some of the broader expression pattern movements was not surprising, as all products offer the standard techniques for analyzing gene expression data, such as the fundamental P-value that assesses the statistical confidence one can place in numerical changes. Even on this point, though, some differences emerged, at least in the emphasis drawn by the vendors during their presentations.
Differences were reported, for example, in the numbers of genes identified as being up- or downregulated at particular time points. Such differences can spring from variations in analysis method, but they also follow from the expression levels that are typically measured at each time point as ratios to time zero. This is done because expression levels tend to change geometrically: In the absence of changing biological conditions, if a gene's expression doubles after an hour, it will double again after another hour. Therefore, to obtain a coherent linear graph useful for correlating genes with similar expression profiles, logarithms of expression levels are plotted against time.
This approach uses the expression levels at time zero as the denominator for subsequent expression ratios, so if these levels are very low, initial sample inaccuracies are amplified. As a result, some of the vendors ignored genes whose expression values were close to zero at the start.
But as Silicon Genetics pointed out during its presentation, a researcher should not assume that a gene is insignificant simply because it was dormant at the start of the experiment. During its analysis, Silicon Genetics first identified those genes that were activated or repressed from a nonzero start point using the standard ratio technique. Then the company turned its attention to genes that started out close to zero but then were subsequently up- or downregulated by comparing differences in expression levels between successive time points, instead of comparing ratios. In this way, Silicon Genetics sidestepped the risk of obtaining too many false-positive or false-negative findings resulting from unstable fractions with denominators close to zero.
HOMEMADE IS BETTER?
DNA microarray pioneer Joseph DeRisi of the University of California, San Francisco, in emphasizing the importance of analytical flexibility in such software packages, urges laboratories to write their own software. "If they can't write their own, then probably people should buy a commercial package for their daily needs and write their custom software for certain analyses, and then re-import the data into the package."
DeRisi himself found that no commercial package could cope with the highly periodic malaria transcriptome data produced recently by his own laboratory.1 "We used Fourier transforms to extract the phase information and produce something that looks like cluster grams," says DeRisi, adding that no available gene-clustering technique supported such manipulations. He notes that software capable of easily publishing microarray data on the Web also is unavailable.
In practice, most laboratories will need to rely on at least one commercial package for most of their analyses, but it is worth noting that microarray analysis packages are not mutually exclusive and may work well together. Indeed, two of the eight vendors that took part in the Affymetrix series (Insightful and Spotfire) sometimes bundle their products, combining Spotfire's visualization prowess with Insightful's more rigorous statistical analysis, according to O'Connell.
The Affymetrix series proves that it pays to narrow the choice to software that has already been used successfully for similar workloads. It should be capable of integrating multiple sources of data and working with both custom and existing analysis methods. Also vital is the ability to integrate with external gene databases for extraction of biologically significant information.
Inforsense's Knowledge Discovery Environment
Silicon Genetics' GeneSpring®
Spotfire® Decision Site for Functional Genomics
Strand Genomics' Avadis
VizX Lab's Genesifter.Net™