The take-home message from the MicroArray Quality Control (MAQC) consortium was that microarrays can generate reproducible gene expression data, both across platforms and across laboratories. But it's not the last word in array performance issues.
The consortium set out neither to establish industry-wide standard operating procedures and algorithms, nor to establish quality metrics you can use to measure your own performance. Its data represent a best-case scenario, collected mostly using two RNA samples that vary far more than those researchers are likely to encounter in their own labs. "You're defining a ceiling for how well things should perform," says Marc Salit of the National Institute of Standards and Technology, who was involved in the project as a consultant. "The important thing is that people not draw conclusions that these are the results they should expect...
1. Your lab's final microarray protocol
MAQC notwithstanding, there is no "one" microarray protocol; consortium test sites used protocols that each manufacturer provided for their specific platform - the same protocol you get with every chip you order. These procedures provide a good starting point for your own work, but like all commercial protocols, they aren't written in stone. According to Ed Lobenhofer, senior scientist at Cogenics, an array service provider that participated in the MAQC project, Cogenics uses the MAQC samples routinely to optimize their in-house protocols, for example to improve sensitivity and detection rate. Roderick Jensen, an MAQC participant at the University of Massachusetts, Boston, uses those same samples to test new array scanners and scanner settings.
The key is to then propagate these new procedures to all users and/or sites in your research group. "If all sites follow the same established operating procedure and good scientific lab practice, you will generate comparable, consistent, good reproducible data," says Janet Warrington, vice president for emerging markets and molecular diagnostics, research & development at Affymetrix.
2. A way to measure quality
"From my standpoint as a statistician, when evaluating quality, you need objective criteria, not subjective graphics and thresholds," says Kellie Archer, assistant professor of biostatistics at Virginia Commonwealth University Medical Center. Genome researchers have such criteria in the PHRED score, which is used to measure confidence in individual sequence base calls. But no comparable metric exists to help array researchers assess the quality of a given datapoint or experiment. "There are no quantitative measures right now for how good is good enough. Those things need to be developed," says Salit.
That is especially true of microarrays, which have so many potential sources of error, including variations in spot size, labeling efficiency, and background intensity. RNA obtained from archived tissue samples or via laser capture microdissection, for instance, tends to be more highly degraded than RNA from other sources, says Archer. "There's a real need to assess quality in samples like that." The External RNA Control Consortium, headed by Affymetrix's Warrington and set to begin work this month, should help mitigate this need, says Salit.
3. Solutions for biological variability
When is a replicate not a replicate? When it's alive. Two genetically identical mouse littermates might produce slightly different gene-expression profiles, for instance, because they ate different amounts prior to tissue collection. Similarly, otherwise identical tissue culture dishes might yield different numbers if they were exposed to slightly different environmental conditions in the incubator. This biological variability is distinct from the technical variability that was the MAQC's focus, and according to Todd Golub, director of the cancer program at the Broad Institute of Harvard and Massachusetts Institute of Technology, should be a much bigger concern.
"It's been clear to me for quite some time that while there is some technical variability at the level of the arrays and processing, that the vast majority of variability is biological, and that dominates the technical variability," he says. The resulting fluctuations in signal intensity are particularly troublesome for weakly expressed genes, says John Quackenbush, professor of biostatistics and computational biology at the Dana-Farber Cancer Institute, as the signal is so close to background. The brute-force solution: replicates, replicates, and more biologic replicates.
4. A universal normalization procedure
When looking at a given data point, says Lobenhofer, a number of variables must be considered when transforming (or normalizing) the raw fluorescence intensity of that spot into its final value. Such variables include local background, regional background, negative controls, and so on. Currently, there is no one-size-fits-all normalization procedure, though MAQC participants discussed creating one, Lobenhofer says. Instead, MAQC used different, manufacturer-recommended normalization protocols for each platform.
For now, microarray users must choose their own road. Lisa White, director of the Baylor College of Medicine Microarray Core Facility, says that "there isn't a 'use this one and go' - we use several [algorithms] and make comparisons." This lack of uniformity, says James Fuscoe, director of the Center for Functional Genomics at the US Food and Drug Administration's National Center for Toxicological Research, leads to differing results. "So you're left with, which one is correct?" Jensen suggests using titration mixtures of the MAQC RNAs to compare normalization methods systematically: By mixing the two samples in specific ratios, you can predict exactly how much of a given RNA you should see in the final analysis. "If you normalize everything wrong, your results will be inaccurate," he says.
MAQC BY THE NUMBERS
Organizations: 51
Participating researchers: 137
Array platforms tested: 7 (Affymetrix, Agilent, Applied Biosystems, Eppendorf, GE Healthcare, Illumina, and NCI_Operon)
Nonarray platforms tested: 3 (Applied Biosystems TaqMan, Panomics QuantiGene, and Gene Express StaRT-PCR assays)
Microarrays used in the study: 1,329
Number of genes in common across all platforms: 12,091
Number of array datasets deposited in public databases: 943
Size of compressed MAQC datasets: 4.3 GB
Months from project initiation to publication: 19 (February 2005 to September 2006)
5. Consensus on ranking interesting genes
MAQC's most controversial conclusion involves strategies for ranking differentially expressed genes, the output for most gene-expression microarray experiments. There are two schools of thought for generating these lists, one based on how much the abundance of particular transcripts changes across conditions (i.e., fold-change), and one based on stringent statistical measures.
In light of its data, the MAQC consortium suggests: "Fold-change ranking plus a nonstringent P-value cutoff can be used as a baseline practice for generating more reproducible signature gene lists."
"Background has a large effect on fold change, while variance has a large effect on P-value," says Eric Hoffman, Director, Research Center for Genetic Medicine, Children's National Medical Center, Washington, DC. "The MAQC investigators observed quite inconsistent measures of variance across platforms, and this hurt their P-values, but not their fold-change." Had the project team used different algorithms or had higher background, they would have reached a different conclusion, he says. "What I'm afraid of is that people will go back to their old data, collected with older algorithms, and use fold change, and for some of those algorithms that's going to be a disaster." His recommendation: Stick with statistics.
6. How to identify gene-expression biomarkers
One of the goals of MAQC was to remove doubts about array reproducibility so that regulatory agencies and array manufacturers can start work on how to make array data part of the drug submission process. To date there are no accepted procedures for extracting signature gene lists (or biomarkers), which correlate with clinical outcome, from expression-profiling data. As a result, two labs, looking at the same data, can sometimes produce completely different signatures.
"I look at most of the studies that have been published [to date] as demonstration studies," says Quackenbush. "They've shown you can take gene expression profiles and develop classification algorithms. The next step is to say, we can use this technology to distinguish the optimal signature." That, he says, is a question both of sufficient replicates and algorithm development. MAQC Phase II, which kicked off September 21, is working to address this issue.