The Hum and the Genome
Minds Must Unite
Nina Fedoroff (left) is Willaman Professor of Life Sciences and an Evan Pugh Professor in the department of biology and the Huck Institutes of the Life Sciences at Pennsylvania State University. She is also a member of the National Academy of Sciences. Steve Racunas (center) is a postdoctoral fellow at the Huck Institutes. Jeff Shrager (right) is a research fellow at the Carnegie Institution of Washington's Department of Plant Biology at Stanford University. The authors are collaborating on a computational toolkit for expressing and checking the consistency of biological models with existing knowledge and data.
Experimental biologists today sit at the edge of enormous bodies of information. Although much of it still resides in journals, primary information increasingly inhabits the digital realm. Knowledge bases, which represent facts and experimental observations using abstractions such as metabolic pathways and regulatory networks, have grown in sophistication. And organismal databases abound, varying in complexity from simple repositories to constellations of highly integrated knowledge.
But the volume and complexity of available information far exceeds the synthetic and analytic capacities of any individual. Unaided, we humans are poor at reasoning meaningfully about the complex dynamic systems that make up even the simplest organism. To augment our limited capabilities in a way that creates understanding, we need new "tools for thought": formal representational systems appropriate for biological modeling, and computational tools that can manipulate, check, and use these models to make predictions and form explanations.
BEYOND THE BASES
Biological databases and knowledge bases serve as excellent repositories, but they don't provide the ability to analyze and manipulate models. And pathway knowledge bases, in essence little more than sophisticated electronic textbooks, capture only the static state of what is known. Neither provides the means to use existing knowledge to build new knowledge. Even data analysis programs such as Blast cannot be called tools for thought. These constitute tools to manipulate data, just as a pipette or PCR instrument serves as a tool for manipulating DNA.
Engineers have used mathematical tools for thought since at least the 17th century. Electronic computers were designed in large part to support engineering analyses, and they continue to play a major role through sophisticated software packages such as Mathematica, Maple, and MatLab. Biologists, who are essentially engaged in the "reverse engineering" of biological systems, should similarly be able to create models and use software tools for simulating, manipulating, and analyzing their models.
Tools for thought provide two functions: simulation and analysis. Simulation allows scientists to make predictions from precisely expressed models of the system (usually using a computer). Analysis provides guidance and feedback about validity of the models through comparison, manipulation, and validation against available data and knowledge. Combined, these functions enable scientists to ask "what if" questions about a system, form explanations, and make and test predictions. The main limiting factor, however, is formal representation – the "mathematics" of biological models.
THE RIGHT REPRESENTATION
Many efforts are being made to mathematically capture the essential features of biological systems, whether a MAP kinase phosphorylation cascade, the cell cycle, or a beating heart. A time will come, perhaps, when this will be possible for entire organisms. But for the moment, biologists still must discover the qualitative workings of biological machines.
Only recently have we discovered, for example, that organisms have RNA-based feedback systems that can finely adjust gene-expression levels. Because we have yet to achieve a complete qualitative description of biological systems, quantitative equations are likely to have limited utility in experimental design. Instead, experimental biologists need ways to work with the available qualitative information: the knowledge of genes, their promoters, and their enhancers; the knowledge of protein structure and interaction; where things reside in cells and how these locations change over time; and how information is communicated within and between cells, within an organism, and between organisms and their environment.
Moreover, experimental biologists need to know what other experimentalists have done. Rarely, do they get such information from textbooks-printed or digital-because the process of abstraction that underlies the representations usually excludes supporting and contradictory data and knowledge. But it is precisely by exploring the underlying evidence, especially evidence that doesn't fit the prevailing paradigm, that scientists formulate new ideas.
Alternatives to differential equations make it possible to write down and reason about complex systems in the way that biologists think about them. One such method is a branch of artificial intelligence called qualitative processes (QP) theory.1 QP models are composed of qualitative assertions cast in a formal syntax. Take, for example, the statement: "Increasing light increases the efficiency of carbon fixation to a point, after which the rate of reaction decreases." QP models are qualitative, not quantitative, and their terms can be uncertain, but this does not mean that they are informal. Qualitative values have a precise mathematics, and it is possible to run simulations on such models.2 Their uncertainties can be managed in ways similar to the way that uncertainty is managed in statistics.
THE UNDERSTANDING CYCLE:
Courtesy of Nigam Shah
When experimental biologists advance their understanding about a biological system, the usual starting point is the hypothesis, or model, constructed and refined using available information about the biological system. Refined hypotheses are subjected to experimental testing and hypotheses that survive this validation are shared, generally through publication. HyBrow assists in the tasks bound by dotted lines.
Principles of qualitative modeling, together with a collection of qualitative biological-model components and a qualitative simulator, provide the simulation part of a toolbox for thought. Analysis is harder. It requires integrating and comparing models in the light of what is known, and then enabling manipulation to form new models that might lead to discovery of a previously unknown subsystem, a hidden relationship, or a process or player that might revise and deepen our current understanding.
The simplest thinking tool at the experimentalist's disposal is the hypothesis, a statement that postulates certain relationships within a system. Each hypothesis is a simple model, and a "system-level" model can be thought of as a collection of interrelated hypotheses. Expressed using formal QP-like representations, hypotheses can provide the foundation for computational biological modeling. We can then create tools for thought that perform the requisite simulation and analysis operations upon hypotheses.
Using a rules library that encodes biological requirements and prerequisites for biological events, it is possible to generate a "required background and history" for each hypothesis, together with the set of relationships that must hold true in the wake of each hypothesized event. This is the qualitative analog of simulation. The components of each hypothesis can be checked against literature and data residing in electronic databases, providing the experimentalist with links to the information that supports the hypothesis as well as information that contradicts it. If the interaction with this hypothesis-checking tool were sufficiently intuitive – using diagrams to represent sets of hypotheses – one could use this interface to craft increasingly complex system-level models comprising hypotheses that survive both validity checks and subsequent experimental testing. From there it is a small step to imagine that the model-checking tool could itself assist the experimentalist to identify, by structural homology between models of other biological systems, changes that might improve the model's fitness.
HYBROW AND BIOLINGUA
We have begun a collaborative project to develop such a toolkit based on the Hypothesis Space Browser (HyBrow)3 and BioLingua,4 a knowledge-based biocomputing platform. HyBrow considers hypotheses entered in either structured text or pictorial format, decomposes them into their atomic assertions, and tests these against stored knowledge and data, and for logical consistency. In addition to checking validity, HyBrow provides links to the data or knowledge that both support and contradict assertions (both explicit and implicit) in the model.
The HyBrow vision seeks to not only enable the individual experimentalist to represent and check hypotheses and models, but also to collect the hypotheses that survive experimental testing in a new kind of knowledge base, indexed by the constituent hypotheses themselves and containing links to all the supporting and contradictory data and knowledge. This would enable a community of biologists to develop complex biological models that are firmly based in knowledge and experimentation, while retaining the flexibility to respond to changing concepts and new information.
The BioLingua platform for this vision is a persistent qualitative knowledge base and knowledge-based programming environment. BioLingua enables users to write simple programs to compare, contrast, and update models, as well as to piece together models, extract knowledge from literature, and guide searches in the space of possible models – the panoply of knowledge manipulation. In HyBrow-BioLingua, users will compose hypotheses and models in an intuitive format, and the hypotheses will be stored and analyzed against the integrated knowledge base. The result will provide feedback to the individual about the consistency of the hypothesis with what is already known, and it will suggest modifications to improve the viability of the hypothesis. But the whole is larger than the sum of its parts, comprising a dynamic knowledge base of biological hypotheses and models developed by the community and expressed in a common language.
Tools for thought use formal representation and computational methods to enable scientists to turn complex models over in their hands and heads. Perhaps through projects such as these the biological community will become as comfortable with computational tools for thought as they are with pipettes and PCR.