Bioinformatics, Genomics, and Proteomics

Data Mining Software for Genomics, Proteomics and Expression Data (Part 1) Data Mining Software for Genomics, Proteomics and Expression Data (Part 2) High-throughput (HT) sequencing, microarray screening and protein expression profiling technologies drive discovery efforts in today's genomics and proteomics laboratories. These tools allow researchers to generate massive amounts of data, at a rate orders of magnitude greater than scientists ever anticipated. Initiatives to sequence entire genom

Nov 27, 2000
Christopher Smith

Data Mining Software for Genomics, Proteomics and Expression Data (Part 1)

Data Mining Software for Genomics, Proteomics and Expression Data (Part 2)

High-throughput (HT) sequencing, microarray screening and protein expression profiling technologies drive discovery efforts in today's genomics and proteomics laboratories. These tools allow researchers to generate massive amounts of data, at a rate orders of magnitude greater than scientists ever anticipated. Initiatives to sequence entire genomes have resulted in single data sets ranging in size from 1.8 million nucleotides (Haemophilus influenza genome) to more than 3 billion (human genome)--a single microarray assay can easily produce information on thousands of genes, and a temporal protein expression profile may capture a data picture of 6,000 proteins.1

Integration of Genomica's LinkMapper with ABI's Gene Mapper
It's what you do with the data that counts, however, and that's where bioinformatics takes over. Researchers in bioinformatics are dedicated to the development of applications that can store, compare, and analyze the voluminous quantities of data generated by the use of new technologies.

One of the original functions of bioinformatics was to provide a mechanism to compare a query DNA or protein sequence against all sequences in a database. Several comparison algorithms have provided some successful and powerful computational applications,2 such as Smith-Waterman, FASTA, and BLAST. Early on, query sequences or sets of query sequences were relatively small, ranging from a few to 10,000 nucleotides, and 10- to 1,000-sequence query sets. Because of the proliferation and improvement of HT sequencing technologies, it is now common to find query sequences with 10,000 nucleotides and data sets containing up to 1 million sequences.

The kinds of data developed and the methods for processing and analysis also have changed. Previously, small-scale DNA sequencing projects would perhaps generate 100 sequences (usually 50-400 nucleotides) that could be assembled relatively easily into a contiguous DNA sequence (a contig). Today, contig assembly may involve 1 million sequences with up to 5,000 nucleotides. The burgeoning fields of proteomics and microarray technologies provide another degree of complexity, adding multidimensional information to the biological data cornucopia.


New Scientific Challenges

The exponential rate of discovery in the era of modern molecular biology has been nothing short of phenomenal, culminating with the announcement in June 2000 that preliminary sequencing of the human genome had been completed.3 However, this achievement is just a taste of the scientific successes that are to come in the 21st century. As impressive as it is, the determination of the sequence of the approximately 3.2 billion nucleotides of the human genome, encoding an estimated 100,000 proteins, represents only the first step down a long road. Gene identification does not automatically translate into an understanding of gene function. Although mapping and cloning studies have linked a number of genes to heritable genetic diseases, the true (i.e., "normal") function of a majority of these genes remains unknown.

This dichotomy between gene identity and function will be one source of new research challenges in the 21st century, encompassing problems in biological science, computational biology, and computer science. Biologists will need to decipher the genetic makeup of genomes, map genotypes with phenotypic traits, determine gene and protein structure and function, design and develop therapeutic agents (recombinant and genetically engineered proteins, and small molecule ligands), and unravel biochemical pathways and cellular physiology. Tackling these biological issues will require innovations in computational biology that will be met by the development of new algorithms and methods for comparison of DNA and protein sequence, design of novel metrics for similarity and homology analyses, tools to outline biochemical pathways and interactions, and construction of physiological models. Success in the computational biology arena will require improvements in computational and informatics infrastructure, including development of novel databases as well as annotation, curation, and dissemination tools for the databases; design of parallel computation methods; and development of supercomputers. These latter challenges are particularly important, as high performance computing (HPC) and bioinformatics applications need to be retooled to accommodate the fast interrogation of a plethora of databases, comparisons between relatively long strings of data, and data with varying degrees of complexity and annotation.

The lion's share of interest and effort over the past few years has been directed toward protein identification (proteomics), structure-function characterization (structural bioinformatics), and bioinformatics database mining. The pharmaceutical industry has for the most part driven these efforts in the search for new therapeutic agents. Identifying proteins from the cellular pool and/or determining structure-function in the absence of concrete biological data is a daunting task, but novel technological approaches are helping scientists to make headway on these fronts.


Proteomics: Protein Expression Profiling

Proteomics refers to the science and the process of analyzing and cataloging all the proteins encoded by a genome (a proteome). Since the majority of all known and predicted proteins have no known cellular function, the hope is that proteomics will bridge the chasm separating what raw DNA and protein primary sequence reveals about a protein and its cellular function. Determining protein function on a genomewide scale can provide critical pieces to the metabolic puzzle of cells. Because proteins are involved in one measure or another in disease states (whether induced by bacterial or viral infection, stress, or genetic anomaly), complete descriptions of proteins, including sequence structure and function, will substantially aid the current pharmaceutical approach to therapeutics development. This process, known as rational drug design, involves the use of specific structural and functional aspects of a protein to design better proteins or small molecule ligands that can serve as activators or inhibitors of protein function. A recent technology profile in LabConsumer4 and a meeting review5 detail companies providing proteomics tools.

The multidimensional nature of proteomics data (for example, 2D-PAGE gel images) presents novel collection, normalization, and analysis challenges. Data collection issues are being overcome by sophisticated proteomic systems that semiautomate and integrate the experimental process with data collection. Improvements in the experimental technology have increased the number of proteins that can be identified, with consistency, within a single gel; however, making comparisons and looking for patterns and relationships between proteins and/or particular environmental, disease, or developmental states requires data mining and knowledge discovery tools.

Finding the Needle in the Haystack

Data mining refers to a new genre of bioinformatics tools used to sift through the mass of raw data, finding and extracting relevant information and developing relationships between them.6 As advances in instrumentation and experimental techniques have led to the accumulation of massive amounts of data, data mining applications are providing the tools to harvest the fruit of these labors. Maximally useful data mining applications should:

* process data from disparate experimental techniques and technologies and data that has both temporal (time studies) and spatial (organism, organ, cell type, sub-cellular location) dimensions;

* be capable of identifying and interpreting outlying data;

* use data analysis in an iterative process, applying gained knowledge to constantly examine and reexamine data; and

* use novel comparison techniques that extend beyond the standard Bayesian (similarity search) methods.

Data mining applications are built on complex algorithms that derive explanatory and predictive models from large sets of complex data by identifying patterns in data and developing probable relationships. Data mining workbenches also incorporate mechanisms to filter, standardize/normalize, cluster data, and visualize results.

As a tool to identify open reading frames (ORFs) or hypothetical genes in genomic data, data mining is a new twist on existing gene discovery applications, such as programs that identify intron/exon boundaries in genomic DNA. One of data mining's greatest practical applications will be in the area of HT, microarray-based gene- and protein-expression profiling, where massive data sets need to be examined to identify sometimes subtle intrinsic patterns and relationships. Differential gene analysis has the potential to explicitly describe the interrelationships of genes during development, under physiological stress, and during pathogenesis. The data mining approach taken to analyze microarray data is a function of experimental design and purpose. Investigations analyzing defined perturbations of a given genetic stasis use hypothesis-testing computational methods, whereas genetic surveys and research into fundamental cellular biology use statistical methods. Similarly, the same methods are utilized in analyzing large-scale proteomics data sets.

An extension of data mining is the concept of knowledge discovery (KD), in which the results of data mining experiments open up new avenues of research,7 with obvious and subtle findings forming the basis of new questions from different perspectives. Some of the more prominent data mining applications and KD workbenches are described in the accompanying table.


Predicting Protein Structure and Function

Structural bioinformatics involves the process of determining a protein's three-dimensional structure using comparative primary sequence alignment, secondary and tertiary structure prediction methods, homology modeling, and crystallographic diffraction pattern analyses. Currently, there is no reliable de novo predictive method for protein 3D-structure determination. Over the past half-century, protein structure has been determined by purifying a protein, crystallizing it, then bombarding it with X-rays. The X-ray diffraction pattern from the bombardment is recorded electronically and analyzed using software that creates a rough draft of the 3D structure. Biological scientists and crystallographers then tweak and manipulate the rough draft considerably. The resulting spatial coordinate file can be examined using modeling-structure software to study the gross and subtle features of the protein's structure.

One major bottleneck associated with this classic crystallography technology is the inordinate amount of time it takes to successfully grow protein crystals. This problem is being addressed by HT technology under development that streamlines the crystallization process. This HT crystallography technology performs many crystallization conditions in parallel with real-time photo-video crystal monitoring. This enables the researcher to test thousands of crystallization conditions simultaneously, aborting those conditions that do not work at an early stage and selecting "perfect" crystals suitable for X-ray analysis.

Efforts to bypass the excessive time needed to tweak the rough draft of X-ray crystallographic structures have led to the advancement of computational modeling (homology and ab initio modeling) approaches. These techniques have been under development, in one form or another, since the first protein structure (of myoglobin) was determined in the late 1950s.8 Computational modeling utilizes predictive and comparative methods to fashion a new protein structure. Ab initio methods use the physiochemical properties of the amino acid sequence of a protein to literally calculate a 3D structure (lowest energy model) based on protein folding. As opposed to determining the structure of an entire protein, ab initio methods are typically used to predict and model protein folds (domains). This method is gaining considerably, in part due to the development of novel mathematical approaches, a boost in available computational resources (for example, tera- and pentaFLOPS supercomputers), and considerable interest from researchers investigating protein-ligand (or drug) interactions. Having the structure, even if only hypothetical, for a part of the protein that interacts with a ligand, can potentially hasten drug exploration research.

In homology modeling, the structural and functional characteristics of known proteins are used as a template to create a hypothesized structure for an "unknown" protein with similar functional and structural features. Protein structure researchers estimate that 10,000 protein structures will provide enough data to define most, if not all, of the approximately 1,000 to 5,000 different folds that a protein can assume;9 hence, predictive structure modeling will become more accurate and important as more and more structures are derived. The homology modeling approach has become very important to the pharmaceutical industry, where expense and time are major drawbacks to the classical methods of determining protein structure, even if automation shortens the discovery cycle. Hypothesized models provide an electronic footprint with which researchers may computationally design various "shoes," such as inhibitors, activators, and ligands.10 This provides for better engineering of potential drugs and reduces the number of compounds that need to be tested in vitro and in vivo.

A variety of companies and research initiatives have undertaken these modern approaches to 3D protein structure determination. Most produce structure prediction/modeling applications useful in drug development and basic science research, provide access to proprietary structure databases, and/or will develop customized analysis services for researchers. LabConsumer will present a profile on molecular modeling applications, including those that are key players in homology modeling, early next year.


Tools for the 21st Century

Modern experimental technologies are providing seemingly endless opportunities to generate massive amounts of sequence, expression, and functional data. The drive to capitalize on this enormous pool of information in order to understand fundamental biological phenomena and develop novel therapeutics is pushing the development of new computational tools to capture, organize, categorize, analyze, mine, retrieve, and share data and results. Most current computational applications will suffice for analyses of specific questions using relatively small data sets. But to expand scientific horizons, to accommodate the larger and larger data sets, and to find patterns and see relationships that span temporal and spatial scales, new tools that broaden the scope and complexity of the analyses are needed. Many of these data mining tools are available from the companies highlighted in the accompanying table. These new products and those listed in a previous LabConsumer profile11 have the capacity to expand research opportunities immeasurably.

Christopher M. Smith ( is a freelance science writer in San Diego.



1. W.P. Blackstock, M.P. Weir, "Proteomics: quantitative and physical mapping of cellular proteins," Trends in Biotechnology, 17:121-7, 1999.

2. R.F. Doolittle, "Computer methods for macromolecular sequence analysis," Methods in Enzymology, Vol. 206. San Diego, Academic Press, 1996.

3. A. Emmett, "The Human Genome," The Scientist, 14[15]:1, July 24, 2000.

4. L. De Francesco, "One step beyond: Going beyond genomics with proteomics and two-dimensional technology," The Scientist, 13[1]:16, January 4, 1999.

5. S. Borman, "Proteomics: Taking over where genomics leaves off," Chemical & Engineering News, 78[31]:31-7, July 31, 2000.

6. J.L. Houle et al., "Database mining in the human genome initiative,", Amita Corp., 2000.

7. G. Zweiger, "Knowledge discovery in gene-expression-microarray data: mining information output of the genome," Trends in Biotechnology, 17:429-36, 1999.

8. J.C. Kendrew et al., "Structure of myoglobin," Nature, 185:422-7, 1960.

9. L. Holm, C. Sander, "Mapping the protein universe," Science, 273:595-602, 1996.

10. J. Skolnick, J.S. Fetrow. "From genes to protein structure and function: Novel applications of computational approaches in the genomics era," Trends in Biotechnology, 18:34-9, 2000.

11. C. Smith, "Computational gold: Data mining and bioinformatics software for the next millennium," The Scientist, 13[9]:21-3, April 26, 1999.

12. R.H. Gross, "CMS molecular biology resource," Biotech Software & Internet Journal, 1:5-9, 2000.

Bioinformatics on the Web

Portals to data analysis

The heart of bioinformatics analyses is the software and the databases upon which many of the analyses are based. Traditionally, bioinformatics software has required high-end workstations (desktop to mid-range servers) with a multitude of visualization plug-ins and/or peripheral equipment, and a user (or administrator) willing to routinely download database updates. The mid-range UNIX server is still the standard bioinformatics platform, though there are also a fair number of Microsoft Windows and Apple PowerMac computers. There are also a number of specialized platforms that integrate hardware and custom software into a powerful data analysis tool, such as DeCypher, produced by Incline Village, Nev.'s TimeLogic (; Bioccelerator, from Compugen Ltd. of Tel Aviv, Israel (; and GeneMatcher, manufactured by Paracel Inc. ( of Pasadena, Calif. Yet the amount of time, money, and effort needed to purchase and maintain the hardware, software, and databases required for bioinformatics research can be a considerable burden to a research laboratory.

2D-gel analysis with Compugen's
To circumvent many of these problems, a few commercial entities are now providing fee-based bioinformatics analysis services through the World Wide Web. These services offer several advantages over local stand-alone or server-based analyses. Because they are provided through a Web interface, these services are platform-independent and may be accessed by practically any Web browser. Also, they are world accessible. No longer must researchers struggle with different applications (doing the same function), different computer systems, file formats, and other hurdles to access their data and results. Bioinformatics Web portals truly provide universal access. Some of the more recent application service providers of Web-based bioinformatics tools are presented below.

Bionavigator (, is a product of eBioinformatics Inc., of Sunnyvale, Calif., a spin-off venture of the Australian National Genomic Information Service. This service primarily targets academic researchers and provides access to more than 20 databases and 200 analytical tools, including those for database searching, DNA/protein sequence analysis, phylogenetic analyses, and molecular modeling. Another attractive and useful feature of the Bionavigator is that it can generate publication-quality result output (for example, color-coded multiple sequence alignments and graphic phylogenetic trees)., formerly Pangea Systems of Oakland, Calif., is a major purveyor of annotated sequence data through its Prophecy database. DoubleTwist has recently added fee-based bioinformatics services through an integrated life science portal. Using any one of a number of "research agents," researchers can analyze protein and DNA sequence data. DNA analysis tools provide for the identification of new gene family members, potential full-length cDNAs, and sequence homologs, whereas the protein tools include routines to identify protein family associations, protein-protein interactions, and conserved protein domains., a product of HySeq Inc., of Sunnyvale, Calif., provides access to information describing proprietary gene sequences and related data from more than 1.4 million expressed sequence tags (EST) analyzed by HySeq using its proprietary SBH process. The GeneSolutions Portfolio contains gene sequences, homology data, and gene expression data generated by HySeq. More than 35,000 genes are reported to have been identified and characterized in HySeq's proprietary databases.

IncyteGenomics OnLine Research ( provides a Web portal to the numerous databases developed and maintained by Incyte Genomics Inc., of Palo Alto, Calif., and a personal workbench where researchers can store their sequences, perform analyses, and search the company's databases. (, developed by Compugen Ltd., is an Internet life science research engine providing access to a variety of gene discovery tools. First introduced in December 1999, the latest version (2.0), released in September 2000, includes a variety of tools for the prediction of open reading frames and polypeptides (including an InstantRACE module that uses public and proprietary databases to return a complete cDNA sequence given an input EST), alternative splicing sites, gene function (by similarity to protein domain profiles), and tissue distribution, among others. ( is another service provided by Compugen for the analysis of 2D-gel image data using Z3 software. Researchers have the option of purchasing and operating the software from their own workstations or they may upload their image data to the Web-accessible Z3 platform for analysis.

For researchers working on a nonexistent bioinformatics budget, there are still a host of powerful bioinformatics applications, accessible without charge, on the Web. If the researcher needs only to perform one or two types of analyses, and if data security, having to work through several disparate applications, and output format are not critical issues then these gratis Web tools are a bargain.

A comprehensive listing of more than 2,300 Web-based bioinformatics tools (and information sources), organized according to the type of analyses they perform, is available through the CMS Molecular Biology Resource12 ( at the San Diego Supercomputer Center, University of California. A good place to start is at the National Institute of Health's National Center for Biotechnology Information Web site ( This server contains sequencing and mapping data for nearly 800 different organisms through the GenBank database, all searchable using the BLAST tool. NCBI also contains an ORF finder, the Online Mendelian Inheritance in Man (OMIM) database of human genes, and a variety of other useful tools, most of them cross-indexed to the NCBI PubMed MEDLINE database.

--Christopher M. Smith