Ge and second author Zhihua Liu were partners in genetics professor George Church's annual course, Genomics and Computational Biology. As part of that course, students must pick an individual project and run with it, says Church. "If the projects are good enough, then I continue to encourage them and help them take it to the next step." Ge and Liu's project was to correlate expression-profiling (transcriptome) data with protein-protein interaction (interactome) data. What they found was that these data sets overlap, providing a way to add confidence to studies that would emanate from such results.
This project stems from the ever-increasing gap between available sequence data and available functional data. Some researchers now rely heavily on high-throughput methods, such as DNA microarray analysis, to illuminate new and interesting genes. Because of the vast quantity of data such techniques produce, the data is often clustered, in which data analyses group genes that behave in a similar way. In the case of biochip data, gene clusters tend to change expression levels in a certain way in response to a variety of treatments. The theory is that if two genes are in a cluster, and the function of one is known, scientists can infer that the other gene's function is related, perhaps by being in the same pathway.
|©2001 Nature Publishing Group|
Protein-protein interaction data is often similarly used. Scientists reason that if two proteins are members of a macromolecular complex, they likely have similar or related functions. Thus, investigators often wish to use expression-profiling and interaction data as launching points for further studies, to identify the functions of unknown genes.
But, how can researchers proceed with the confidence that they won't spend years chasing a red herring, or that they haven't missed something really important? The answer, says Vidal, is to overlap multiple types of data, creating a "biological atlas."2 Church explains, "When you find a coincidence between RNAs that co-cluster and proteins that interact, it ... reinforces your confidence that they're both correct."
Ge's approach was to group publicly available expression profiling data into 30 clusters. She then took public yeast two-hybrid interaction data, and investigated whether interacting proteins are encoded by genes in the same cluster (intracluster) or in different clusters (intercluster). The final data take the form of colored squares arrayed in a right triangle with 30 squares on each side. The interaction density for a given square is shown as a color gradient, with yellow squares containing the highest density.
When all the data were input into the matrix, most of the brightest squares were found along the triangle's diagonal, meaning that intracluster interactions are more common than are intercluster ones. When these data were plotted as a histogram, Ge found that the intracluster interaction density is about 6-7 times higher than the intercluster density. In other words, genes that are expressed coordinately often interact at the protein level, and conversely, proteins that interact are often encoded by coordinately expressed genes.
|©2001 Nature Publishing Group|
In a sense, this is perfectly rational, even obvious. "If you imagine a complex of proteins ... you would hope that the genes encoding the subunits of this complex are coregulated," says Vidal. From a teleological point of view, this is in the cell's best interest. As Church points out, "if you have two gene products that work as a dimer, and one of them is expressed and the other one isn't, the other one is like 'idle hands in the devil's workshop.' It can go off and stick to things." So when Ge began this study, other researchers offered some good-natured ribbing. Vidal says, "When we were working on this, our [lab] neighbors kept telling us, 'yeah, right, so, of course.'" But, he adds, "It's very nice to have proof."
Vidal's biological atlas metaphor extends beyond interactome and expression data, to include biochemical genomics, structural genomics, gene knockouts, and protein localization data.2 Recently, his group published another article demonstrating the correlation of such datasets, in which they overlaid interactome data with large-scale phenotypic analyses, resulting in the identification of several new DNA damage response genes in Caenorhabditis elegans.3
1. H. Ge et al., "Correlation between transcriptom and interactome mapping data from Saccharomyces cerevisiae," Nature Genetics, 29:482-6, December 2001.
2. M. Vidal, "A biological atlas of functional maps,"Cell, 104:333-9, Feb. 9, 2001.
3. S.J. Boulton et al., "Combined functional genomic maps of the C. elegans DNA damage response," Science, 295:127-31, Jan. 4, 2002.