Putting the "Bio" Back in Bioinformatics

The renowned Canadian literary scholar, Northrop Frye, once wrote: "The more trustworthy the evidence, the more misleading it is."

James Brown(james.r.brown@gsk.com)
Dec 19, 2004
<p>James R. Brown</p>

Courtesy of James R. Brown

The renowned Canadian literary scholar, Northrop Frye, once wrote: "The more trustworthy the evidence, the more misleading it is."1 Although he was not referring to the biotechnology revolution, to my mind it is an apt caveat for the contemporary challenges facing the field of bioinformatics in drug discovery. We are becoming awash in new "evidence" (substitute "data") revealing in great depth the intricacies of cellular life, but are we properly using these data to find new medicines?

The evidence in the life sciences is becoming highly trustworthy as well as expansive. Genomes from a variety of species are being sequenced at a breathtaking rate. DNA polymorphisms are being widely determined for human populations according to geography, disease disposition, and other traits. We now have exquisite tools including DNA microarrays, RNA interference technology, and soon, protein expression arrays, for probing the internal...


But how useful are these data when it comes to inferring (or discovering) natural cellular processes and translating such findings into medical therapies? For example, knocking out genes with elevated RNA expression in cancer cells often has no effect on tumor viability. Many examples are known of genes being statistically associated with disease phenotypes, yet it is difficult to make the subsequent leap to develop their encoded proteins into drug targets.

The purpose of any genomics study is to understand the relationship between genotype and phenotype. In drug discovery, we are interested in two fundamental phenotypes: those of the healthy and of the diseased individual. However, genotypic data (broadly meant here as the genome and all its associated processes and nuances) is insufficient in itself to make the connection to phenotype. Genotypes are linked to phenotypes by pathways. The grail of pharmaceutical research is to develop a therapy, either a small molecule inhibitor or biological agent, that will nudge the pathway at a point that will shift the phenotype away from the diseased state to the healthy state, with minimal side effects or adverse events.

Unfortunately, understanding pathways is still highly refractory to computational automation. Certainly, this is not for the lack of effort. Public institutions and private companies have developed a growing number of databases and tools to investigate pathways. Most attempt to distill and quantify publications that are rich in the relevant molecular characterizations, such as protein-protein interactions, transcription-factor binding sites, and so on, through manual annotation, computational text-mining, or some combination of both.

Computational pathway analyses have two major limitations. The first is an issue at the very root of scientific investigation: Negative results are seldom published. When annotating possible pathway components, information on nonparticipating or neutral proteins can be crucial. Neutral proteins would indicate which pathways might be unaffected by the introduction of a small molecular inhibitor; this is particularly important for evaluating drug efficacy and safety. However, pathway software often cannot conclusively demonstrate noninteractions between pathways.

Secondly, in our experience, any pathway-mapping software needs to be carefully fed structured queries in order to generate meaningful and interpretable results. If the proteins in question are separated by a large number of pathway steps, the results can have too many interactions to be interpretable. On the other hand, input components that are tightly linked will give only a narrow view of a subsection of the pathway.

In our hands, the most effective pathway reconstructions involve a knowledgeable, computational biologist who uses a personal mental distillation of the literature and available data to guide the pathway-analysis software. Thus computational pathway reconstruction is not simply an infrastructure solution, but also requires a skilled group of analysts with backgrounds in bioinformatics, the relevant disease and/or molecular biology areas, and increasingly, comparative genomics.


Since the determination of the 5,368-nucleotide genome of the bacteriophage fX174 in 1980, comparative genomics has progressively revolutionized genetic research in direct correlation with the complexity of the study organisms, beginning with viruses, followed by bacteria (and Archaea), fungi, plants, invertebrates, and now vertebrates, including mammals. In the pharmaceutical industry as well as the general medical research community, comparative genomics has been mainly focused on the "holy trinity" of medical research: the mouse, rat, and human. An exception is infectious diseases where comparative genomics has played a vital role in understanding viral, bacterial, and parasitic pathogens. However, with more than 25 additional mammalian, avian, and cold-blooded vertebrate genomes now being targeted for sequencing, tremendous opportunities will arise for understanding the human genome by rigorous evolutionary analysis.3


of the indicated mammals are being generated for regions orthologous to targets of the ENCODE Project. Comparative sequences will also be generated for other vertebrate species, including the frog, zebrafish and chicken (not shown).

Rather than just focusing on gene lists, researchers in the field of vertebrate comparative genomics recognize the opportunity to use evolutionary conservation to identify noncoding regions that might give clues to gene regulatory pathways. The ENCODE (ENCyclopedia of DNA Elements) Project Consortium is focused on identifying all functional elements in the human genome.4 In its pilot phase, the Consortium is concentrating on several regions that are 0.5 to 2 MB in size and make up about 30 MB or 1% of the human genome. Orthologous regions in more than 20 other mammals, either currently available or to be newly sequenced, will be key to the annotation of human DNA elements. An added bonus is that several species that have roles in drug development, including dogs and primates, will be included in the comparison.

We have found that rigorous evolutionary analyses, based on a broad representation of species, have important roles in a number of different therapeutic areas. For example, evolutionary analysis of Aurora kinases provided a new context for the transference of knowledge from model systems; it also indicated a potential opportunity for targeting the ATP-binding pockets of multiple kinases with a single inhibitor.5 Annotation of pathways used by the malarial parasite, Plasmodium falciparum, benefits from the recognition of the unique evolutionary history of its genome, which involved the acquisition of bacterial, fungal, and plant gene homologues via multiple endosymbiosis events.

Forthcoming genomic sequences from the chimpanzee will be important not only for identification of human regulatory regions, but also for improving the resolution of SNP analyses. Molecular evolutionists have long recognized that selection on single-base-pair mutations can be neutral, negative, or positive. Neutral mutations often fluctuate between two nucleotide types and can be associated with population processes other than selection, such as genetic drift and bio-geographic variation. The best means to identify those selective forces is by comparisons using a closely related species known as an outgroup.

Presently, the nearest outgroup species to humans with complete genome sequences are the rat and mouse. However, the overall higher mutation rates in these rodents, as well as the length of time since divergence, makes it difficult to align their DNA sequences with human sequences. The chimp genome is 99% similar to humans, so it, as well as those of other primates being staged for sequencing, will provide an excellent outgroup filter to better identify selected versus neutral nucleotide mutations in humans. This has the potential to greatly improve the signal-to-noise ratio of SNP and haplotype analyses.

As with pathways, overautomation of evolutionary analyses must be viewed cautiously. When homology-finding software such as BLAST is used to order conserved proteins, the order is not well equated with evolutionary relationships. Further analyses of collected sequences by experts using tools in multiple-sequence alignment and phylogenetic analyses are critical to understanding the evolutionary relationships of human proteins.

Many opportunities exist for constructive interactions between bioinformatics software developers and computational biology analysts. The role of the latter has sometimes been underplayed in bioinformatics. In industry, however, they are the front-line interface with the bench scientists and often have a critical role in experimental and project design. The complexity of the tools and the biological questions being addressed on the one hand, and the advances in information technology platforms on the other, inevitably mean that no single individual can effectively span both realms. As long as communications remain open, multidisciplinary teams are the most effective organization of drug discovery bioinformatics support. Northrop Frye1 was not advocating that one ignore evidence, rather that naive processing of the facts should always be subservient to the understanding gained by reason, insight, and rational argument. These humanistic processes are impossible to automatically compute.

Interested in reading more?

Become a Member of

Receive full access to digital editions of The Scientist, as well as TS Digest, feature stories, more than 35 years of archives, and much more!
Already a member?