Textual model questions efficiency of gaining scientific knowledge
By Ishani Ganguli | March 14, 2006
Published scientific statements, whether they are later proven true or false, have a profound effect on subsequent interpretations by researchers and on the probability that they will eventually come to a correct conclusion about a scientific question, a statistical analysis of protein interaction literature reveals. The findings, published this week in the Proceedings of the National Academy of Sciences (PNAS), suggest that the way these "microparadigms" bias future interpretations may actually slow down the process of gaining scientific truth.
The paper "sets up a very sophisticated model" to answer large-scale questions about how scientific knowledge is produced that "no one has previously been able to measure," said Neil Smalheiser at the University of Illinois at Chicago, who did not participate in this study.
These findings hint that the "current way we produce and interpret results is not optimal" for scientists to ultimately converge upon the correct result, according to first author Andrey Rzhetsky of Columbia University. "The model suggests that dependence between statements is too strong."
Rzhetsky's team assessed 1.5 million unique statements about protein interactions from 150,000 full text articles in 78 journals (GENEWAYS 6.0). Using a binary system (eg. protein A either interacts or does not interact with protein B), they chronologically ordered statements about each pair of proteins to construct chains of reasoning over time.
The group then simulated different ways scientists might approach published findings, and assessed the probability that each scenario would lead to the correct answer at any given step of the chain. If scientists trust nobody, for example, and ignore all previous literature, the probability of publishing the correct result remains constant. At other extremes, scientists could be super-conformists (usually agreeing with the majority opinion about a given protein-protein relationship) or super-anti-conformists (usually agreeing with the minority opinion).
The authors searched their real world data set for these hypothetical patterns; while all five were present, the pattern of mild skepticism was most common.
When they measured the momentum, or strength of influence, of published statements on future interpretations, they found that scientists give their own data at least 10 fold greater weight than others' findings, but are still heavily influenced by previous results and particularly the majority opinion -- revealing a tendency for conformism. What's more, the authors discovered that a strikingly large proportion of results (95%) are positive -- reporting presence rather than absence of an interaction.
According to the authors' stochastic analysis, this predominance of positive results can only be explained by two extremes: A very low rate of experimental errors or exceptionally invalid experiments. So scientists are either perpetuating truth or perpetuating errors, Rzhetsky said.
The authors also found that the momentums of actual published statements are too high to optimize the probability of coming to the right result at the end of a given chain. This phenomenon could be explained by the premium placed on new data in science publishing, said Gully Burns at the University of Southern California, who did not participate in this study. "You can't really get things published simply reproducing other people's results," he said. To produce correct scientific knowledge more efficiently, Rzhetsky suggested "independent benchmarking" by an institution that would periodically verify a sampling of the literature.
This paper demonstrates the utility of similar exercises for large-scale data mining, even beyond protein interactions, Burns told The Scientist. Researchers have performed similar data mining only in sequence databases, added Smalheiser. "It's probably as sophisticated an example of text mining as there is so far, [and] more direct and more sensitive than citation analysis."
Still, according to Burns, it will be important to "parametrize more details of individual experiments" in a future model, for example by accounting for the section of the paper in which a statement is found or the animal model or cell type used to derive it.
Rzhetsky said this work is part of a larger effort to sort and evaluate millions of facts from the literature to create an overarching model of cellular interactions. "A huge amount of information is already published and locked in literature," he said. "We're trying to get that information out."
Links within this article
A. Rzhetsky et al., "Microparadigms: Chains of collective reasoning in publications about molecular interactions," PNAS, March 14, 2006.
R. Finn, "Program uncovers hidden connections in the literature," The Scientist, May 11, 1998.