The State of Bioinformatics

Disentangling the good from the bad--gene and protein data, that is--may be the toughest task for today's bioinformatics scientists assembling new models for proteins, says Greg Paris, executive director of biomolecular structure and computing at Novartis Pharma Research in Summit, N.J. "One of the major advances has been the speed with which new genomes can be characterized and at least partially annotated, and there are very good gene-finding tools that help in this endeavor," he explains. "T

Arielle Emmett
Nov 26, 2000

Disentangling the good from the bad--gene and protein data, that is--may be the toughest task for today's bioinformatics scientists assembling new models for proteins, says Greg Paris, executive director of biomolecular structure and computing at Novartis Pharma Research in Summit, N.J. "One of the major advances has been the speed with which new genomes can be characterized and at least partially annotated, and there are very good gene-finding tools that help in this endeavor," he explains. "The minus is that the annotation process strongly relies on prior annotation, so that even with the most careful attention, it's possible to propagate low-probability guesses as though they were high-probability facts. This means that downstream, disentangling the quality of the evidence [from the quantities] is quite problematic. From a pharmaceutical perspective, a double-edged positive/negative is the vast quantity of data that is now available. It presents extreme challenges for data mining."

In the wake of last June's near completion of the Human Genome Project,1 some scientists are now calling for new agendas and tools for next-phase bioinformatics--the science of understanding the structure and function of genes and proteins through advanced, computer-aided statistical analysis and pattern discovery. Wade Rogers, senior research associate for the Corporate Center of Engineering Research, DuPont Corp., puts it this way: "The human genome is an enormously rich source of fundamental data, but the data's availability is both a blessing and curse. The more data we have to work with, [the more challenging it is] to find fundamental nuggets of useful information buried in that data." He is among those who argue that software, processing, and computational algorithms more powerful than today's will be needed to decipher the functional interactions of multiple genes and proteins--the driving agenda of structural genomics in the near future. "Even though you can never hope to find all the patterns, we need more efficient algorithms that scale polynomially--clever algorithms that enable us to solve problems that are otherwise not solvable," Rogers continues.

Bioinformatics scientists already count several achievements in recent years enabling them to identify the function of several thousand genes. Among these are more efficient methods of searching base pair sequences for matching patterns; the development of statistical tools; and assembly of the whole genome using novel computational techniques (See also, "Bioinformatics, Genomics, and Proteomics," page 26). The new approaches to bioinformatics, including the Rosetta Stone and phylogenetic profile methods, will "surpass the traditional method of sequence homology, which seeks correlations between amino acid sequences," suggests David Eisenberg, professor of molecular biology and director of the University of California, Los Angeles-Department of Energy laboratory of structural biology and molecular medicine.

David Eisenberg
Instead of trying to identify discrete genes and proteins by a few defined molecular functions, "we want to know the interactions among all the proteins of a cell, and even beyond that, all the molecules of a cell," Eisenberg states. This goal would include interactions of "small molecules with large molecules, and large molecules with each other; how proteins interact with DNA, RNA, and other proteins; and, going to the next level, how the messengers from one cell interact with other cells. And those messengers can include small molecules, metabolites, hormones, and large molecules--proteins going from one cell to another. "I see the future of informatics as describing all these hierarchies of interactions," Eisenberg says, arguing that scientists will analyze genomic and expression data for "networks of functional linkages between proteins in cells," which in turn will redefine the classical notions about protein function.

"The classic function of a protein was that a protein catalyzes or carries out the specific reaction in a cell; either converting substrate S to product B or binding to a small molecular X," Eisenberg says. "But in the 'post-genomic' view, the function of a protein is in the context of its interaction with all the molecules in the cell .... No longer do we think of a protein as having one job, but rather it's defined by having all these different functions, interactions, and connections. The work we've done already suggests that even in simple cells like yeast, each protein interacts on the average with five to 10 proteins minimum."


New Protein Structure Initiative

Eisenberg's lab, which focuses on protein folding research, will be in the vanguard of a high-throughput protein structure initiative announced in September by the National Institute of General Medical Sciences. NIGMS has set a goal among seven institutions to determine the structures of thousands of proteins over the next decade, with a $150 million award for projects carried out in an initial five-year pilot stage. This could have a direct impact on bioinformatics technology. "This project can be viewed as an inventory of all the protein structure families that exist in nature," says Marvin Cassman, NIGMS director.

"This really amounts to making a family classification--putting proteins into families, sequences, and structures, and [using] genomes and homology modeling for small structures as a way to determine a large number of structures--proteins with similar sequences," comments John Norvell, director of the NIGMS Protein

Structure Initiative. In short, it's achieving a shortcut in coverage of protein structures. "[We have a goal of] doing 10,000 structures in 10 years," Norvell explains. "The methodology and technology has developed at an incredible rate over the last few years--we have breakthroughs in synchotrons, NMR [nuclear magnetic resonance] high field magnets [and] better ways of obtaining crystals for expressing and purifying proteins." The logical follow-up phase to the Human Genome Project "is to find out what all these genes do and characterize the proteins in many ways--proteomics; spectroscopy, electrophoresis. But certainly one component is to determine the proteins' atomic shape, structure, and physical shape. That's what we're calling structural genomics; that's the launch of [an] effort to determine protein structures on a larger scale than has been done before. We already have a total of about 13,000 structures in the database, so there may be tens of thousands structures out there," he says.

How bioinformatics ought to answer the problems of function and interaction remain complex. "The field is in a state of flux; things are happening so rapidly that standards haven't been established, and the game keeps changing," states Charles E. Lawrence, chief of the bioinformatics lab, Wadsworth Center, New York State Health Department. Lawrence and his team have developed a novel algorithm, PROBE, for predicting the structure of protein targets and identifying bacterial transcription signals. He and his coauthors2 won the 2000 Mitchell Prize from the International Society of Bayesian Statistics.

"Now that we have the whole genome, the question is what does it do? Where are the genes?" he asks. " You've got to predict what parts of the sequence are actually genes--where are the exons? You've got to get it right to get the proteins right--but the algorithms have only been about 70 percent accurate to predict where the exons are from raw genomics data. My opinion is that's better than nothing, but 70 percent is not too good." Though some of the accuracy and exon identification issues will be addressed when comparative data are available through the human/mouse genome mapping projects, "right now we're in this conundrum--predicting with algorithms that are clearly lacking."

Lawrence argues that even widely used algorithms actually lack sophisticated interfaces accessible to nonspecialists. He believes that the sheer volume of genetics data is forcing a paradigm shift toward new data-intensive methods for extracting knowledge, building more accurate homology models for proteins, as well as predicting how proteins function given their structure. This forces a move away from the traditional laboratory and into the complexities of still more efficient search algorithms and an ever-expanding computer database. "We're moving from the perspective of hypothesis- driven research to a new paradigm that's data-driven and puts as much creative energy in analyzing the data as the old one did in designing the experiment," he observes.

"The biggest angle on [the whole genome] and what it does is when and under what circumstances sets of genes are expressed together, and there's even a bigger statistical analysis problem," Lawrence adds. "That's the part I'm working on. The goal is understanding the mechanism behind the differences in [gene expression]. When you get down to it and look at the sequences of other primates, we're going to find those genes are pretty much all the same. So why are we different from another primate? It's going to be in the differential regulation of the expressions of genes; and our lab's work is focused on using the genome sequences to understand the regulation and the common co-regulation of genes--to comprehend what's responsible for the common patterns coming out of arrays. That will be a very important field. There will be a whole regulation issue for splicing out of introns--a whole other regulatory story to figure out in addition to which genes get expressed."

Problems that scale exponentially (2n) are called NP-Hard, and are also intractable because no matter how large your computer, a small increase in the size of the problem moves it beyond the computer's capability. Problems that scale polynomially (N3) are tractable, because it is always possible to build a sufficiently fast computer to solve them.

Celera's Next Step

Celera Genomics Group, the private corporation responsible for mapping the human genome and developing the now-famous whole genome shotgun sequencing method,3 now says that discovering expression patterns will become a key element of new informatics work.

"Beyond the systematic exploration of whole genomes, we're systematically exploring [gene] expression patterns manifested by different tissue types and disease states [and] doing a wide systemic study of all the expression patterns that do occur," said Gene Myers, vice president for informatics research at Celera ( "We'll explore the levels of protein expression. In RNA, it's a measure of transcript expression [the frequency with which a protein is made]. But [current] expression data is very soft. It's quite inaccurate; and although you get a reliable indication of whether you've got a particular protein, it's much harder to get information on how much of it there is."

Celera will place much of its research priority on "interpreting regulatory signals upstream to determine how they're turned on and off, and what's the genetic variation across human populations and the regulatory mechanisms that cause disease states and variations in phenotypes." Mouse genomics data will aid in the interpretation task--"we've used it extensively to do annotation of the human genome, and now we can go to the next level and predict genes that cannot otherwise be found without comparison to the mouse genome," Myers says.

Celera is aiming at a bioinformatics model it terms the "cyberpharmaceutical paradigm." This integrates genomics and pharmacogenomic data with single nucleotide polymorphisms (SNPs), proteomics, biology, and informatics data to present researchers with more complete drug discovery tools. "This will amount to methods for determining how computationally you can test new agents [to fight] diseases," comments Sorin Istrail, senior director of informatics research. "We plan to offer a wide range of data in an integrated fashion; we'll be generating a tremendous amount of expression and proteomics data; along with SNP information. Our [business] model is to replace big pharma IT [information technology] departments by offering services in a time-sharing way. This would be more cost effective for them to use us for data and services rather than maintaining a separate IT organization." Celera asserts it is plying high levels of accuracy of its data plus the value of developing new releases and database query tools for subscribers to Celera's services. The company will not reveal details on any new algorithms that may be developed, but it plans to offer annual subscribers "novel orthogonal data sets" as they become available.


Are the Modeling Tools Adequate?

Stephen Bryant, a senior investigator at the National Center for Biotechnology Information (NCBI; www., says that protein prediction and classification into families and superfamilies remains one of the most time-consuming tasks for bioinformatics researchers. But he is generally satisfied that existing statistical tools--and computers--can get the job done. "What's left is to get lots of [protein] molecules into families, and that's a very hard problem. So we're shifting to a library of conserved domain databases--grappling together the sequence and structure of families and representing the [relationships] between structure and function [through annotation]," he explains. "There's a lot of technology to keep updating databases," but he claims the time taken to decipher and describe [protein families] and common features of families is far more limiting than the database software or algorithms themselves.

Diverse approaches to the field are commonplace, however. At DuPont, computational scientists are working on a class of problems known as NP-Hard (non- polynomial-hard), which appear to be relevant to the mining of genomic databases. "It turns out that the problem of [genomic and protein] pattern discovery has been widely thought to be in the HP-Hard class," says Rogers. "That's an interesting class of problems because the calculation time grows exponentially with input. Problems that scale exponentially are called intractable--no matter how big the computer, we won't be able to solve them."

Rogers' team is partnering with DuPont Pharmaceuticals on the analysis of a particular class of proteins called G-Protein Coupled Receptors; GPCRs are molecules found in the membranes of many cells, and are involved in signaling. "They're implicated in an estimated 60 percent of all pharmaceutical interactions, so they're an obvious target for companies developing compounds to treat [many conditions]." Rogers is using his new methods of computational analysis to understand fundamentally how GPCRs work to develop compounds that target those receptors and have fewer deleterious side effects.

Protein folding is another area of intense inquiry--especially because researchers not only have bioinformatics analysis challenges, but also must attempt to use graphical 3D representations of structures based on amino acid sequences. According to Chris Bystroff, an assistant professor at Rensselaer Polytechnic Institute, the two approaches to predicting 3D structures include first principles approaches such as molecular dynamics simulations and pattern recognition approaches such as sequence alignment and profile analysis. The former demands huge amounts of processing power; the latter attempts to find patterns by "look-up" from a table of known structures.

"We don't have enough computational power to handle molecules as big as proteins [using the first method]," Bystroff says. "So we are doing cryptography to find out what sort of amino acid sequence patterns correspond to 3D structure patterns or motifs. I've worked on a motif library called the I-SITES library; which is a mapping of sequence patterns to motif structures []. With it you can look up, given an amino acid sequence, what the structure should be locally. The technique is local structure prediction--using short pieces of sequence to predict motifs," he says. The motif library method is the first step in a hierarchial approach to protein structure prediction [and], in a sense, a look-up table.

Bystroff adds that the NCBI-based search program, BLAST, has greatly sped up searches. "Before, it took a long time for a query/sequence to be compared. But now with BLAST, it's rapidly used, and has fancy indexing techniques that have become a commonly used tool in bioinformatics." This has enabled NCBI to become a center for bioinformatics advances. Bioinformatics specialists nationwide are focusing intense scrutiny on expression arrays, protein structure prediction, structures of RNA, and investigations of transcription factor binding sites.


Rosetta Stones and Phylogenetic Profiles

At Eisenberg's UCLA-DOE laboratory, researchers are developing two methods for supplying functional information for many proteins at once. One method is the phylogenetic profile, which Eisenberg says represents an advance over traditional homology. Phylogenetic profiles enable researchers to place a protein in its context of cellular function, he argues. "The idea is that we looked to see all of the fully sequenced genomes in which a protein is found; and then we look for a second protein that is present in exactly the same set of genomes. That is, the pattern of presence and absence of two proteins is the same. Why the same? Our hypothesis is that the two proteins are functioning together; and the reason they're always present or absent is that they work together, so we're able to infer many linkages."

A second method the lab has pioneered is known as the "Rosetta Stone" method, in which two proteins from one type of cell are found, say in yeast, "but are fused into a single protein chain. We infer that those two proteins function together because they are right in the same molecule. Therefore we infer, for example, that in [Escherichia] coli they interact; and that two messages are giving us the same information about the system." Eisenberg observes that "by using the two methods and others, we've been able to build up these networks of interactive proteins."

However, the work is just starting, he adds: "I see just the beginning of mapping out the networks of interactions of proteins within the cell. This problem hasn't been solved; it's just been formulated. The database of phylogenetic profiles predicts protein-protein interactions and catalogs experimental interactions, but both types of information have to be built up, predicted, and compared. Moreover, the statistical methods for assessing and prioritizing predictions have to be refined until there's an agreement between experimental and predictive methods," Eisenberg says.

Barry Honig, a Howard Hughes Medical Institute investigator and professor of biochemistry and molecular biophysics at Columbia University, is also working on 3D protein structures as determined by crystallography. His lab center is part of the NIGMS Protein Structure Initiative, and he is studying multiple structure alignments, signals, and sequences that are characteristics of particular structural families of proteins. "Recently we've discovered that the electrostatic properties of proteins determine where the protein goes in the cell," he states. Now that it's possible to look at protein families rather than at individual proteins alone, "we need to find out what these families have in common, how they're different--very often the differences are reflected in the fold of the protein and properties of the protein surface." Honig's lab has devised a program, GRASP, which is used widely in the structural biology community, and whose graphics "map physical properties on the protein surface, where we can see electronic potential, sequence conservations, and hydrophobicity."

Honig continues, "The whole idea of high-throughput determination is we can't determine the structure of every single protein, so we'll have to build models for structures for which we don't have experimental determinations. A clear goal is to make those models as accurate as possible." That is the evolving role of bioinformatics technology, he adds. "We still have a great deal of trouble detecting relationships between proteins that don't have obvious sequence relationships, but the technology has gotten better," Honig maintains. "We can only build accurate homology models when the sequence identity between the unknown and known is relatively high. We need scientific tools to extract information from data that becomes available--not only to understand sequences better, but to use the information to understand how proteins function together."

That will be a major agenda for bioinformatics in years to come. Honig concludes, "We can't understand function from structure alone. We need computational tools to describe physical and chemical properties of proteins to determine how they function."S


Arielle Emmett ( is a contributing editor for The Scientist.


1. A. Emmett, "The human genome," The Scientist, 14[15]:1, July 24, 2000.

2. S. Liu et al., "Markovian structures in biological sequence alignments," Journal of the American Statistical Association, 94:445, 1999.

3. P. Smaglik, "'Shotgun wedding': public, private Drosophila sequencing agreement should speed project, ensure accuracy," The Scientist, 13[5]:1, March 1, 1999.