Oncomine and caBIG Advance Cancer Bioinformatics

The Google search engine has revolutionized knowledge dissemination over the Internet.

By Arul Chinnaiyan(arul@med.umich.edu) | April 11, 2005


The Google search engine has revolutionized knowledge dissemination over the Internet. With more than 3,000 queries per second, each sifting through hundreds of gigabytes of data, Google has blasted a new path to learning. In the emerging "-omics" era, biomedical research needs a similar but more intelligent integration of knowledge. Can the impressive scale and functionality of the Google approach be harnessed to navigate molecular profiles of cancer?

Scientists can already explore a substantial component of the expressed genome and proteome, but information is accumulating to the point of information overload, like drinking water from a fire hose. What's more, the information is heterogeneous on several levels – in terms of how it is generated, represented, and stored. In this context, bioinformatics – the convergence of biology, information science, and computation – will play a critical role in the future of cancer biology and pathology-based research.

The growing importance of this area has given birth to the National Cancer Institute's cancer Biomedical Informatics Grid (caBIG) pilot initiative http://cabig.nci.nih.gov, which is attempting to build a unified bioinformatics infrastructure for cancer centers across the country, and possibly the world. Eventually caBIG will allow clinical researchers to share information about patients eligible for clinical trials and biospecimens available for large multi-institutional translational studies. Researchers will also be able to share bioinformatics tools that will be compatible across institutional information technology systems. My lab is doing its part, and along with others we are consolidating data and building tools that, like Google, help biologists mine the existing data without having to worry about where the data come from.



Arul M. Chinnaiyan, MD

DNA microarray technology has revolutionized the way fundamental biological questions are addressed in the postgenomic era. With the sequencing of the human genome, investigators can monitor the expression of nearly all expressed human genes (and alternative splice variants) on a single DNA "chip." Rather than focusing on one gene at a time, these genome-scale methodologies provide a global perspective. Comparative analysis of genome-wide mRNA expression patterns can uncover sets of genes important in specific phenotypes or clinical behaviors. Also, large-scale gene-expression profiles for tumors theoretically allow for subsets of genes to be identified, which function as prognostic disease markers or predictors of therapeutic response.

The increasing access to microarray technology has meant that cancers of almost every type have been molecularly profiled. As of December 2004, more than 300 studies have been published using gene-expression microarrays, representing more than 10,000 human tumors and 200 million data points. But these data are not found in any sort of unified, easily mined databank. Nor are results from individual experiments easily compared.

Individual investigators often use different experimental platforms and apply distinct analysis techniques to interpret their data. Despite common standards, data are often dispersed in different Web sites, or must be obtained through direct communication with corresponding authors.

As a result, the average biologist, who may lack bioinformatics expertise, cannot easily plumb the wealth of expression data found in the public domain to improve patient care or our understanding of cancer mechanisms. Aware that this experimental treasure trove exists, cancer biologists may yet lack the computational facilities and tools to use it.

In an effort to make human tumor gene-expression data more accessible to the academic community, my laboratory has created a cancer microarray compendium and data-mining platform called Oncomine http://www.oncomine.org. Evolving from the thesis project of Dan Rhodes, a graduate student in my lab, Oncomine now involves a team of programmers, database administrators, and bioinformaticians.

The database's latest version (3.0) contains 105 datasets, more than 7,000 microarray experiments (each representing an individual human tumor), and in excess of 13 million statistical tests. A biologist can come to the Web site and ask basic questions such as: In what cancers or cancer subtypes is my gene dysregulated? Which are the genes that distinguish metastatic cancer from clinically localized disease? What genes may serve as biomarkers for a particular cancer or cancer subtype?

Thus, much as Google provides a gateway to the Internet, Oncomine provides a window into the cancer transcriptome. That window could open even wider if, as we hope, Oncomine and its related tools become incorporated into the larger caBIG infrastructure. Yet even at its early stage of development, the database already averages 75 to 100 users per day from a total registered user base of more than 3,500 worldwide.


Data compendia such as Oncomine have several potential uses. Investigators, for example, can perform meta-analyses between datasets in order to validate candidate bio-markers or common molecular features of cancer.1 Likewise, expression patterns correlating with particular clinical outcomes can be validated among independent datasets. This practice is becoming more common as a way to demonstrate the robustness of molecular signatures. Also, hypotheses or molecular signatures derived experimentally might potentially be validated in silico.

While scientists are rapidly becoming proficient at profiling gene expression in tumors, the next major hurdle is to elicit biology from these global screens. Next-generation analytical approaches are revealing functional modules and enriched pathways of genes.2 For instance, instead of merely detecting overexpression of the receptor tyrosine kinase, ErbB2, it is possible to detect activation of this pathway via differential expression of downstream genes. This will allow for therapeutic targeting of pathways at multiple levels rather than single targets.


Courtesy of Vasudeva Mahavisno and Dan Rhodes

From cells to pathways

Further insight into these pathways will be gained by comparing oncogenic signatures from cell lines and in vivo cancer progression models with the expression patterns obtained from tumor specimens.3 Mapping pathway signatures in vitro and in vivo may shed light on neoplastic transformation in humans.

Integrative bioinformatics analyses like those carried out by Segal et al., our group, and others124 will generate novel hypotheses with respect to cancer progression. For example, by carefully maintaining clinical annotations of the specific tissue specimens analyzed, we were able to identify gene alterations that were common to cancer regardless of tissue of origin, as well as gene signatures characteristic of more aggressive dedifferentiated cancers.1 Similarly, by integrating diverse in vitro and in vivo gene-expression information, Segal et al., identified an osteoblastic module active in selected solid tumors.2

Characterizing cancer gene-expression patterns from a systems perspective will also involve understanding the protein-protein networks and transcriptional regulatory programs at work behind the scenes. Protein-protein interaction datasets obtained from model organism yeast two-hybrid studies, interaction databases such as the Human Protein Reference Database http://www.hprd.org, and computational prediction will be useful in reconstructing coordinately regulated protein complexes. Thus, we potentially will be able to mine the cancer transcriptome to predict interacting complexes in tumor progression. Oncomine can provide these types of data.

Likewise, by linking genes to their transcriptional regulatory elements and modeling these relationships, we may begin to make predictions about transcription factors and regulatory proteins responsible for the gene-expression alterations observed. By integrating knowledge of transcription factor binding from the TransFac database and gene-promoter sequences, for instance, one may be able to predict the regulatory programs mediating a particular gene-expression signature.


One future goal is to seamlessly integrate high-throughput molecular data with clinical informatics systems and conventional morphology-based pathology. This will benefit clinical researchers and have immense value for scientists involved in basic translational research. Consider, for example, the cancer biomarker field. It is already relatively common to use DNA microarrays to profile human tumor specimens for identifying candidate markers, which are then validated on tissue microarrays bearing hundreds of clinically stratified patient specimens.56

The molecular realm itself has become quite complicated, and can encompass alterations in DNA, mRNA transcript, proteins, and metabolites, among other molecular constituents. Integrative analyses and modeling of 'omic molecular data with more standard clinical and pathology parameters will be important to the future of cancer-related research, and translational research in general. Being able to develop models that take into account disparate molecular and clinical data and tumor morphology could influence patient care in terms of predicting prognosis and tailoring therapy. The caBIG initiative will play a major role in coordinating this integration effort across cancer centers.

Perhaps in the near future we will be able to use molecular signatures or panels of multiplexed biomarkers to identify whether a tumor is malignant or benign, its site of origin (if metastatic), its prognostic subtype, and even predict its response to therapy. Integrative analyses of gene-drug pairs from therapeutic drug databases with gene-expression signatures may lead to personalized multidrug regimens based on an individual's tumor gene-expression signature.

Clearly, high-throughput molecular techniques and bioinformatics will have a central role in the future of pathology and oncology in both the clinical and research realms. We now have the opportunity to develop systems that integrate, warehouse, analyze, and model disparate data elements for both investigative and clinical decision-making. Google would be proud.

Arul M. Chinnaiyan, MD, PhD, is the S.P. Hicks Collegiate Professor of pathology and associate professor of urology at the University of Michigan Medical School. He is also director of the pathology microarray lab and the cancer bioinformatics core. He has coauthored more than 75 publications and established the Oncomine cancer microarray project.

He can be contacted at arul@med.umich.edu.

Popular Now

  1. Salk Institute Suspends Cancer Scientist Inder Verma
  2. Long-Term Study Reveals Flip in Plant Responses to Carbon Dioxide
  3. Jim Bridenstine Confirmed to Lead NASA
  4. RNA Injection Restores Hearing in Guinea Pigs