Making Sense of Microchip Array Data

Microarray Analysis Software Packages     Courtesy of SpotfireSpotfire's Array Explorer Gene expression profiling using DNA microarrays generates reams of data. But as is so often the case, it's not the quantity but the quality that matters: Gene expression data is useless unless biologically meaningful information can be extracted and presented in some readily understandable fashion. The production of this meaningful information, involving many facets of image processing, statistic

Apr 30, 2001
Michael Brush

Courtesy of Spotfire

Spotfire's Array Explorer

Gene expression profiling using DNA microarrays generates reams of data. But as is so often the case, it's not the quantity but the quality that matters: Gene expression data is useless unless biologically meaningful information can be extracted and presented in some readily understandable fashion. The production of this meaningful information, involving many facets of image processing, statistical analyses, and data visualization, is only possible with computers running sophisticated software.

The overall process begins with cDNA or oligonucleotides spotted in two-dimensional arrays onto glass slides or nylon membranes, or synthesized on biochips using technology borrowed from the semiconductor industry.1 Messenger RNAs isolated from the test and reference tissues are labeled by reverse transcription with either a red or green fluorescent dye, mixed, and hybridized to the microarray. After washing, the bound fluorescent dyes on the arrays are interrogated by a laser, producing two images, one of each color. The resulting ratio of the red and green spots on the two images provides information about the changes in expression levels of the genes across experimental conditions.

Microarray data analysis software developers and companies have created numerous software packages designed to analyze the images and hybridization intensity data obtained from the arrays. These packages handle--with varying levels of complexity--the general problems associated with making sense of the gene expression data derived from DNA microarray technology.

Software Array

Microarray analysis software packages fall into three general categories. The first group consists of stand-alone packages designed for the general user. These products, such as the ArrayGauge from Fujifilm Medical Systems Inc. of Stamford, Conn., and the ImageMaster Array 2 from Amersham Pharmacia Biotech of Piscataway, N.J., accept images from most microarray scanners, typically in the form of 16-bit TIFF files. They offer plenty of flexibility for analyzing data generated by differing instrumentation and array types.

Courtesy of Axon Instruments

Axon Instruments' GenePix Pro

A second group consists of software packages configured to operate within specific array scanners. For example, the arrayWoRx™ Microarray Scanner from Applied Precision Inc. of Issaquah, Wash., features an optional analysis software package that provides concurrent imaging and analysis of microarrays. This allows users to immediately link the output images to tabular results and graphical ratio analysis results as each slide is scanned. Other packages, such as the GenePix Pro 3.0 Array Analysis Software from Axon Instruments of Foster City, Calif., and the QuantArray software from Packard BioChip Technologies of Billerica, Mass., (formerly GSI Lumonics) are integral parts of the GenePix 4000B scanner and ScanArray microanalysis systems, respectively. These systems are also available as stand-alone packages for increased utility.

A third group consists of software crafted to analyze array-specific systems. The Microarray Suite Software from industry pioneer Affymetrix Inc. of Santa Clara, Calif., services the company's GeneChip microarrays. Similarly, ArrayTools™ software from Incyte Genomics of Palo Alto, Calif., investigates the large amount of data generated by Incyte's LifeArray™ microarrays, and the Pathways 3 Microarray Analysis Software from Research Genetics of Huntsville, Ala., handles the needs of the company's GeneFilters, and can be used to process a variety of arrays in different formats. VistaLogic from Genometrix Inc. of The Woodlands, Texas, provides gene expression and gene analysis of the data generated by Genometrix's VistaArraySM microarrays. As a high-throughput system, VistaArrays are designed to process 96 arrays in parallel.

Courtesy of MolecularWare

MolecularWare's DigitalGENOME software suite

Software suites are commonplace. Consisting of various software modules with specific functions and integrated into one unit, suites exhibit enhanced functionality when related modules work in concert. The DigitalGENOME™ software suite from Cambridge, Mass.-based MolecularWare Inc., combines three software solutions into an integrated workspace capable of handling the data needs of microarray workflow. In this case, sample tracking information obtained from the production of microplates and arrays is used as a framework for capturing and processing contextual information about each spot or well, simplifying the annotation of millions of microarray spots further along in the process. The suite also provides an open framework for extensive descriptive annotation by the user, according to MolecularWare president Seth Taylor. "Researchers can add thoughts, ideas, links to websites, and other relevant information to create a personal database as unique as their individual lab notebooks," he says. DigitalGENOME also provides enhanced communications options by readily integrating with leading data mining and analysis packages.

Image is Everything

Regardless of the category into which they fall, the essential task of array analysis software packages is image analysis. Indeed, the primary goal is to measure the intensity of the arrayed spots and then convert those intensity values into quantified expression data. While this may appear straightforward, numerous problems can lead to questionable results. The initial issue is the assignment of a grid or template to define the spot locations from which the data will be captured. If not performed carefully (some arrays exhibit positional distortions), hybridization values can be assigned to the wrong spot or will be erroneous because only part of a spot was examined. Spot irregularities stemming from the array fabrication process and bright or dark regions within specific spots caused by detritus also complicate the image analysis process.

Most software packages that carry out image analysis include an automatic alignment process that requires little, if any, user intervention. These alignment programs automatically apply a grid to the array and then extract hybridization values. Some packages, like GenePix Pro from Axon Instruments, graphically flag spots as good, bad, or absent and then export these flags with all other numerical data. Still other algorithms successfully accommodate irregularly shaped spots as well as assign spot boundaries.

Courtesy of Scanalytics

The MicroArray Suite from Scanalytics

The Align Arrays module of the IPLab MicroArray Suite for Macintosh from Fairfax, Va.-based Scanalytics Inc. illustrates another alignment strategy. Align Arrays superimposes the control and experimental microarrays and corrects misalignments by shifting one array relative to the other. This feature ensures the highest possible accuracy when calculating the ratio of intensities between the two channels.

Background determination can also present problems, particularly when signal intensities are low, but software designers have developed various methods to determine background values. One of the most common methods involves the measurement of the signals around each spot. This value is then subtracted before any ratio calculations are performed. Other approaches include subtracting a standard background value or determining background from specific areas on the array, a technique offered by ArrayVision™ from Imaging Research of Ontario, Canada.

Courtesy of Media Cybernetics

Media Cybernetics' Array-Pro Analyzer

Because clusters of genes indicative of phenotypes or disease states tend to be expressed at low levels, Silver Spring, Md.-based Media Cybernetics has equipped its Array-Pro Analyzer software to accurately calibrate signal and background intensities based on standards, controls, and replicates. The software can therefore reliably extract small signal changes in the presence of background noise.

Spot Check

After an array of hybridization values has been gleaned from the image analysis software, hybridization ratios are calculated. Traditionally, a two-fold change in the ratio of a spot is accepted as the indication of up- or down-regulation of a gene. Tabular representations of the results (often in tab-delimited forms for export into spreadsheet and word processing software), ratio histograms, and scatter plots are common visualizations of the data.

Courtesy of BioDiscovery

Screenshot of a PCA plot using BioDiscovery's GeneSight

As important as these results are, they hold little meaning if the underlying hybridization values are of poor quality. How can one determine if the expression changes are indeed statistically significant and thus worthy of additional study? Much of the product literature mentions statistical testing or rigid Quality Control testing of the images before data analysis. For example, QuantArray from Packard Biochip Technologies examines measurements like diameter, circularity, pixel area, spot uniformity, and deviation from nominal position to determine confidence factors and a resulting pass/fail report for each spot. Similarly, BioDiscovery Inc. of Los Angeles has developed methods to place statistical confidence on the gene expression levels. BioDiscovery's microarray data analysis software system GeneSight™ employs statistical methods such as analysis of variance (ANOVA), and scientists can eliminate false positives from over-expressed gene values with statistical confidence selected by them through the company's ConfidenceAnalyzer™.

Designed to enhance the data mining process, the recently released ArrayStat™ from Imaging Research was developed specifically to address the statistical analysis of hybridization data. According to product manager Nezar Rghei, "ArrayStat uses novel statistical algorithms to generate sensitive estimates of measurement error. With these numbers in hand, the software then applies classical dependent and independent statistical methods to evaluate differences in gene expression." Rigorously validated, these novel algorithms derive error estimates from the whole array and thus provide error estimates with as few as two replicates per array element. ArrayStat then exports statistically analyzed data into any data mining software program.

Data Mining

With the hybridization data analyzed, the next goal is to search the data and arrange genes according to similarities or dissimilarities in their patterns of expression or identify functions for previously uncharacterized genes. This job is not trivial when thousands of genes are involved across several months of data collection.

The discovery of patterns in gene expression relationships falls under the realm of data mining. For microarrays, data mining methods are derived from mathematical techniques developed to identify patterns in applications like phoneme detection in speech processing and bandwidth compression in telecommunications. Known collectively as clustering, these multivariate statistical methods have become the essential tools for the elucidation of gene expression patterns in microarray data.

Data mining typically uses three types of clustering techniques. Hierarchical clustering is a common approach whereby data sets are split into classes and then subclasses, eventually forming a hierarchy displayed as a dendrogram. This technique has proven valuable in microarray data analysis.2 K-Means, a nonhierarchical clustering method, repeatedly examines the data to create and refine clusters in order to maximize the significance of the intergroup distance. The third method, called Self Organizing Maps (SOMs),3 is a variation of the K-Means methodology. SOMs are a subfield of neural networks, a system of algorithms developed to explain how parts of the brain might self-organize into precise structures.

Courtesy of BioDiscovery

Screenshot of the main user interface of GeneSight, from BioDiscovery Inc.

Another useful data mining method, principal component analysis (PCA), is a "data reduction" technique used to identity uniquely expressing genes. PCA replaces a large number of variables with a smaller number, with little loss of information. BioDiscovery's GeneSight data mining system uses clustering and PCA as complementary techniques to identify groups of genes for further analysis. "When we do clustering, we have to look into the biological meaning," explains Alexander Kuklin, applications manager for BioDiscovery. Consequently, GeneSight uses the published expression patterns of known genes as examples for researchers to compare their data against as a means of identifying unknown genes. If an unknown gene behaves like a known group of cancer genes, for instance, then important information is gained about its function.

Presenting all of this gene expression data in visual form, however, creates challenges to which software designers are finding colorful solutions. "We like to guide users away from a table of numbers and into a graphical format where they can easily visualize trends in the data and see what's really important," says Genometrix software project manager Karon Dacus. The company's VistaLogic software incorporates a variety of methods for visualizing gene expression profiles. In addition to the typical dendrograms and red and green color block charts used by many data mining packages to depict cluster analysis results, VistaLogic uses a unique 3-D plot with peaks and valleys to depict up and down regulated genes. This plot rotates and has a zoom feature designed to extract additional information from the data. VistaLogic also includes a 3-D scatter plot presentation of principal component analysis results, and radial, line, and bar charts for individual groups of profiles. The graphics are very flexible and user friendly, easily modified, and readily exported into other Windows-based applications for report generation. With this same intuitive graphical environment, VistaLogic users can examine genotype and allele frequency distributions, and explore genotype-phenotype associations.

Courtesy of Silicon Genetics

GeneSpring Expression Analysis software from Silicon Genetics

GeneSpring™ Expression Analysis software from Silicon Genetics of Redwood City, Calif., offers a host of visualization and analysis tools. Included among the visualization options are graphics showing the physical position of a gene on a chromosome or plasmid (when the information is available) and expression profile graphs. Users can click on an individual profile to learn more about the underlying gene and can find genes with a similar expression profile based on a variety of similarity measures (e.g., Pearson correlation or Euclidian distance).

BioDiscovery's GeneSight offers some 19 color maps for data visualization in its clustering modules in addition to GenePie™. GenePie automatically converts data to pie shapes that use color to show the ratio of control to experimental gene expression levels.

Bioinformatics and data storage are the culmination of the microarray analysis process.4 Many of the software analysis packages offer immediate access to many public or institutional genetic databases via the Internet, as well as access to numerous databases for storage and comparison of data across multiple experiments. The Affymetrix Laboratory Information Management System (LIMS) 2.0, for example, provides central data management of the information generated by the Affymetrix GeneChip technology, and MolecularWare's IntegratorDG™ securely stores and publishes data from leading database environments.

The future of microarray technology looks bright, particularly in light of the recently published sequence of the human genome.5,6 The possibilities are exciting, with dramatic new findings within reach, says Alexander Kuklin. "With automation and all of these [microarray] tools, researchers can, with the click of a mouse, go through their data mining and find the gems without too much sweating. It's like sitting in front of your TV and climbing Mount Everest."

Michael Brush ( is a freelance writer in Tustin, Calif.
1. J. Cortese, "Array of options," The Scientist, 14[11]:26, May 29,2000.

2. C.M. Perou et al., "Distinctive gene expression patterns in human mammary epithelial cells and breast cancers," Proceedings of the National Academy of Sciences, 96:9212-7, 1999.

3. T. Kohonen, Self-Organizing Maps, Springer Verlag, Berlin-Heidelberg, Germany, 1995.

4. C. Smith, "Computational Gold: Data mining and bioinformatics in the next millennium," The Scientist, 13[9]:21, Apr. 26, 1999.

5. J.C. Venter et al., "The sequence of the human genome," Science, 291:1304-51, Feb. 16, 2001.

6. International Human Genome Sequencing Consortium, "Initial sequencing and analysis of the human genome," Nature, 291:1304-51, Feb. 15, 2001.