Opinion: Latent Value in the Literature

FLICKR, JOHN MARTINEZ PAVLIGAIn the United States, like so many countries around the world, scientists are learning to get by on shoestring budgets. Reduced federal funding has the biomedical community looking for ways to squeeze more out of every grant. In an age of austerity, success is about how smart we are with each dollar invested in the country’s research enterprise.

One way to eke more value from those research and development dollars is to extract more knowledge from the scientific data and findings we have already generated. Decades of investment have produced a wealth of information from basic research projects, translational studies, and clinical trials. But we as a community often do not know what is available, and we lack a coherent resource that brings all of this information together—a massive repository that can be queried, or a powerful tool that can gather intelligence from a number of...

Recent efforts within the open access community are making inroads toward improving access to data and publications, but there remains a general lack of awareness of the existence of relevant data, as well as the challenges of picking the highest-quality data and repurposing it for new use. Primary data sets are buried in an increasingly large number of data silos that are impractical for the average scientist to track, and data are typically formatted in a fashion convenient for the producer, not the consumer. This makes finding, extracting, and reusing key discoveries and insights difficult. As a result, it is often easier for researchers to obtain funds to regenerate data than to reuse it.

Tools designed to gather, query, and mine existing data could yield useful insights. Consider Google Maps, which through powerful algorithms creates personalized driving directions, predicts commute times, and suggests travel alternatives. These algorithms rely on Google’s high-quality, well-structured, up-to-date central database of geographic information. The Google Street View cars keep the map up to date so that we, the users, don’t have to worry about it.

Scientists need a similar solution: a global, Internet-accessible Google Maps of biomedical knowledge, such that new bioinformatics techniques and tools can empower researchers and clinicians around the globe looking to better understand, diagnose, and treat human disease.

If this sounds like one person’s rose-colored vision of the scientific field, it isn’t. My company, QIAGEN Silicon Valley, works in the field of knowledge-driven data mining in medicine and biology. A proof-of-principle project we did for the U.S. Department of Defense showed that latent knowledge in the scientific literature can be organized and reused to effectively predict fundamental new insights.

Our study used powerful algorithms to predict molecular drug targets against infection by high-risk viral and bacterial agents. We trained our database by feeding it all experimental evidence we could find in the research literature that described host-pathogen interactions and the associated molecular and pathway biology of these organisms. Our approach spanned decades of knowledge across virology, bacteriology, immunology, and basic biology—far more than any individual researcher could read and assimilate, let alone stay up-to-date on. Our specialized “knowledge construction” techniques included scouring every data source, integrating information into standardized semantic data models of disease biology, and feeding it into a single graph-based data model, which our high-performance computers processed.

We then set our algorithms loose on the database, asking them to identify potential drug targets. The algorithms returned a slew of them, prioritized according to level and quality of evidence. Those that appeared most compelling to our research team went through experimental testing, which showed that one-third of the predicted targets had a significant impact on survival in mice infected by these biothreats.

Many of the targets were novel and associated with unusual biological mechanisms, in some cases ones that were not broadly established in immunology. The approach we used linked experimental results from diverse fields of research that characterized genes or pathways critical to the pathogenesis of these viruses and bacteria. The database was able to repurpose experimental findings from, say, cancer or cardiovascular studies when they were relevant to immunology and infectious disease.

This project demonstrated that a large-scale in silico approach can successfully screen and identify new drug targets, even for complex problems like broad-spectrum therapeutic discovery. The knowledge of those targets was already out there, in some cases in publications that have been out for years. It was just a matter of bringing the right sources together and developing a tool that could make sense of the information.

When you combine this type of approach with the tremendous amount of new data being generated every day in science, the possibilities seem endless.

Human disease and physiology is marvelously complex, and we are still early in developing useful mathematical and computational models of living systems. We have a long road ahead of us until computers can simulate complex biology or predict cures for every human illness. Indeed, constructing and maintaining such an atlas of biomedical research knowledge would be a challenge. But we are making progress.

There is a tremendous amount of unmaterialized value in the ever-growing scientific literature and databases. Indeed, as a community, our ability to generate data is outpacing our ability to effectively convert this data into actionable insights that drive better decisions in clinical research and drug discovery. That means everyone stands to benefit not from simply demanding more research funding but also from capturing the intelligence that is already there and just waiting to emerge. Together, with some creativity and elbow grease, we can harness the collective knowledge of the scientific community to let clinicians and researchers unlock the mysteries of human disease and ultimately deliver better diagnostics and therapeutics to patients around the world.

Ramon Felciano co-founded Ingenuity Systems (now QIAGEN Silicon Valley), where he is vice president of research.

Interested in reading more?

The Scientist ARCHIVES

Receive full access to more than 35 years of archives, as well as TS Digest, digital editions of The Scientist, feature stories, and much more!

Already a member?

Opinion: Latent Value in the Literature

With scientific budgets eroding, the biomedical community needs to get more return from the data it has already generated.

Interested in reading more?

The Scientist ARCHIVES

Become a Member of