Opinion: Latent Value in the Literature

With scientific budgets eroding, the biomedical community needs to get more return from the data it has already generated.

By | April 28, 2014

FLICKR, JOHN MARTINEZ PAVLIGAIn the United States, like so many countries around the world, scientists are learning to get by on shoestring budgets. Reduced federal funding has the biomedical community looking for ways to squeeze more out of every grant. In an age of austerity, success is about how smart we are with each dollar invested in the country’s research enterprise.

One way to eke more value from those research and development dollars is to extract more knowledge from the scientific data and findings we have already generated. Decades of investment have produced a wealth of information from basic research projects, translational studies, and clinical trials. But we as a community often do not know what is available, and we lack a coherent resource that brings all of this information together—a massive repository that can be queried, or a powerful tool that can gather intelligence from a number of studies at once.

Recent efforts within the open access community are making inroads toward improving access to data and publications, but there remains a general lack of awareness of the existence of relevant data, as well as the challenges of picking the highest-quality data and repurposing it for new use. Primary data sets are buried in an increasingly large number of data silos that are impractical for the average scientist to track, and data are typically formatted in a fashion convenient for the producer, not the consumer. This makes finding, extracting, and reusing key discoveries and insights difficult. As a result, it is often easier for researchers to obtain funds to regenerate data than to reuse it.

Tools designed to gather, query, and mine existing data could yield useful insights. Consider Google Maps, which through powerful algorithms creates personalized driving directions, predicts commute times, and suggests travel alternatives. These algorithms rely on Google’s high-quality, well-structured, up-to-date central database of geographic information. The Google Street View cars keep the map up to date so that we, the users, don’t have to worry about it.

Scientists need a similar solution: a global, Internet-accessible Google Maps of biomedical knowledge, such that new bioinformatics techniques and tools can empower researchers and clinicians around the globe looking to better understand, diagnose, and treat human disease.

If this sounds like one person’s rose-colored vision of the scientific field, it isn’t. My company, QIAGEN Silicon Valley, works in the field of knowledge-driven data mining in medicine and biology. A proof-of-principle project we did for the U.S. Department of Defense showed that latent knowledge in the scientific literature can be organized and reused to effectively predict fundamental new insights.

Our study used powerful algorithms to predict molecular drug targets against infection by high-risk viral and bacterial agents. We trained our database by feeding it all experimental evidence we could find in the research literature that described host-pathogen interactions and the associated molecular and pathway biology of these organisms. Our approach spanned decades of knowledge across virology, bacteriology, immunology, and basic biology—far more than any individual researcher could read and assimilate, let alone stay up-to-date on. Our specialized “knowledge construction” techniques included scouring every data source, integrating information into standardized semantic data models of disease biology, and feeding it into a single graph-based data model, which our high-performance computers processed.

We then set our algorithms loose on the database, asking them to identify potential drug targets. The algorithms returned a slew of them, prioritized according to level and quality of evidence. Those that appeared most compelling to our research team went through experimental testing, which showed that one-third of the predicted targets had a significant impact on survival in mice infected by these biothreats.

Many of the targets were novel and associated with unusual biological mechanisms, in some cases ones that were not broadly established in immunology. The approach we used linked experimental results from diverse fields of research that characterized genes or pathways critical to the pathogenesis of these viruses and bacteria. The database was able to repurpose experimental findings from, say, cancer or cardiovascular studies when they were relevant to immunology and infectious disease.

This project demonstrated that a large-scale in silico approach can successfully screen and identify new drug targets, even for complex problems like broad-spectrum therapeutic discovery. The knowledge of those targets was already out there, in some cases in publications that have been out for years. It was just a matter of bringing the right sources together and developing a tool that could make sense of the information.

When you combine this type of approach with the tremendous amount of new data being generated every day in science, the possibilities seem endless.

Human disease and physiology is marvelously complex, and we are still early in developing useful mathematical and computational models of living systems. We have a long road ahead of us until computers can simulate complex biology or predict cures for every human illness. Indeed, constructing and maintaining such an atlas of biomedical research knowledge would be a challenge. But we are making progress.

There is a tremendous amount of unmaterialized value in the ever-growing scientific literature and databases. Indeed, as a community, our ability to generate data is outpacing our ability to effectively convert this data into actionable insights that drive better decisions in clinical research and drug discovery. That means everyone stands to benefit not from simply demanding more research funding but also from capturing the intelligence that is already there and just waiting to emerge. Together, with some creativity and elbow grease, we can harness the collective knowledge of the scientific community to let clinicians and researchers unlock the mysteries of human disease and ultimately deliver better diagnostics and therapeutics to patients around the world.

Ramon Felciano co-founded Ingenuity Systems (now QIAGEN Silicon Valley), where he is vice president of research.

Add a Comment

Avatar of: You



Sign In with your LabX Media Group Passport to leave a comment

Not a member? Register Now!

LabX Media Group Passport Logo


Avatar of: Selc


Posts: 3

April 29, 2014

 This is a great idea...but there will be big pushback from publishers, despite that much of this information greater than 10 years and some even say immediately should be in the public domain.   I think it is particularly important because often scientist are shy to say too much in the abstract, but one can find very useful information, negative results, interesting hypothesis in the results and discussion sections of papers.  My own experience in the exercise literature there were many conflicting results and conflicts  that too much exercise could be harmful...even as far back as the 60's, yet only now do you here this in more well documented studies.   There are countless pearls if you read the full papers in the biological and physical sciences.  The German and Russian physics literature is very good also, and having students even read the stuff back in the 30's can clear up many questions and confusions they have in classes today and often the professor doesn't know the answer or can explain clearly.  It would be great if this could all be searched as a google type search. A tool which automatically generates data tables from graphs would also be very useful to reserachers as not all data is well represented, and sometimes in the physics and chemistry literature all one has is a few obscure papers where measurements have actually been taken. It would be great if publishers stopped worrying about copyright or copywrong and started offering such useful services in the historical literature.

Avatar of: Ramon Felciano

Ramon Felciano

Posts: 1

Replied to a comment from Selc made on April 29, 2014

May 23, 2014

Hi Selc --

Thank you for your comment. You raise a number of good challenges to the approach I describe, but our experience suggests that they can be overcome.

It is quite true that much of the most relevant information is not available in abstracts, which is why for 15 years our curation and knowledge modeling efforts have been based on full text articles, including tables, figures, etc. As you allude, this can be expensive, as we have to buy every article we read. This is not unlike writing a review article, which is a new body of work that integrates and synthesizes our knowledge, with appropriate citations and links to the primary source materials. The review article is a useful metaphor for what we do; we've simply done it at very large scale, and produced a new, very large computational data model rather than a more traditional publication.

Our modeling formalisms are also fairly robust, in that they are able to correctly model negative results (or "evidence of absence") as well as contradictory results you discuss. Our view is that such results are part of science and should be embraced: we are, by definition, working in a field where our understanding of the world is not yet well-developed or self-consistent. So we look to provide researchers with tools to quickly identify when a preponderance of independent lines of evidence support a given hypothesis or model of disease biology, versus other areas that may have less clear support due to conflicting or inconsistent results. In many cases the latter may point to the need for additional validation studies that tease apart the difference in experimental outcomes reported in the literature.

Finally, your point about the value of non-English research literature is well-taken. One of the benefits of semantic knowledge models is their language independence. In principle we could integrate knowledge from a broad swath of publications in different languages, unify and normalize them to a single harmonious computer model, and then render them back out to users in whatever format (or language) they prefer. We'd love to find a way to test out this idea at some point soon.

Thanks again for your comments!


Popular Now

  1. Can Young Stem Cells Make Older People Stronger?
  2. Thousands of Mutations Accumulate in the Human Brain Over a Lifetime
  3. Two Dozen House Republicans Do an About-Face on Tuition Tax
  4. CRISPR to Debut in Clinical Trials