Coding a Bridge Across the Data Divide

If you want to know how biology will be practiced in the coming decades, check out a recent National Academy of Sciences colloquium on frontiers in bioinformatics.

By | December 20, 2004


If you want to know how biology will be practiced in the coming decades, check out a recent National Academy of Sciences colloquium on frontiers in bioinformatics.1 Assembling data is no longer the biggest challenge, says meeting cochair Russ B. Altman, citing sophisticated presentations on pseudogenes, RNA splicing, and molecular evolution. Instead, the major hurdle these days is one of data integration.

Biologists are beginning to form hypotheses that require the computational skills to mine 20 genomes at once, to compile and analyze data from multiple sources that were never intended to be mutually compatible. For now, it's an online game of cut-and-paste that favors younger, more computer-savvy biologists.

But bioinformaticians are working to smooth the process, so that scientists at the bench can worry less about how to get a piece of data, and more about what they want to do with it once they find it.

It is a two-pronged problem. First, data repositories and databases hold vast stores of information, but are largely unable to exchange this information with related resources. Second, many bioinformatics tools have unique interfaces and output formats, potentially limiting a user's ability to shunt the data into other programs for further analysis. In both cases, users are forced to navigate online resources with different purposes, terminology, concepts, interfaces, query languages, formats, annotation systems, and architectures.

A growing number of bioinformatics services are striving to build bridges to span these divides. Yet even if making use of the huge amount of data already online were not daunting enough for the untutored, changes are in the works. The ultimate goal for some data service providers and bioinformaticians is a new "semantic Web" to address the complexity of life science research. Instead of searching for key words, biologists would be able to probe multiple databases for key concepts – for example, "proteins that upregulate gene X" – and run automated analyses that update themselves with changes in the online literature.

This effort isn't limited to scattered independent groups. The World Wide Web Consortium (W3C), a nonprofit organization that sets technical standards such as HTML for the Internet, is interested in playing a role. In October, John Wilbanks, a W3C fellow, gathered players in Boston to compare notes.2 "Something I didn't expect, and it's been very gratifying to see, is how many people are involved in making it easy for scientists to discover services in a meaningful way," he said immediately afterward, as his organization prepared to assess what its contribution might be.

Those most engaged in these efforts say they have made significant progress toward easing data integration. The frontrunners are largely service providers – a mixed brew of database managers, computer scientists, and biology-trained bioinformaticians. But except for computer-savvy biologists, word has not trickled down to many end users at the laboratory bench. Some tools are in use, but, like new gizmos under the hood of a car, they are not necessarily apparent when you start up the engine. Virtually all are works in progress.

Indeed, in her review of techniques for query optimization, Zoé Lacroix of Arizona State University in Tempe concludes, "the evaluation of complex queries on biological data sources, capturing the diverse search, query processing, and domain-specific computational capabilities, raises several challenges yet to be addressed."3


In 2003, Lincoln D. Stein of Cold Spring Harbor Laboratory proposed a "knuckles and nodes" approach to the growing problem of data integration.4 The nodes would be focused databases, and the knuckles single-task services that enable scientists to link information from one node to another. So, for instance, a knuckle might keep track of orthology relationships while each node would be a database for a species with a specific ortholog.

The infrastructure emerging from the compendium of integration efforts is not a mirror image of Stein's proposal, but it does bear some resemblance. Two models are in the lead, says Stein, who developed Wormbase, the Caenorhabditis elegans model organism database The first model comprises large, broad database services that have little depth and act as go-betweens linking smaller databases. The second model covers less ground but explores a particular area of research in greater detail.


Courtesy of Mark Wilkinson

In BioMOBY a client (user) can interact with multiple sources of biological data regardless of the underlying format or schema. In a typical query, servers register their services with MOBY Central. A client, with a piece of data in hand, queries MOBY Central for services that can accept that type of data as input. Once MOBY Central returns one or more such service descriptions, the client chooses one, submits its data to that service, and receives another form of data in return.

Two new initiatives from Stanford could be examples of the latter. Prolinks shows functional linkages of proteins in 168 organisms. The Pharmacogenomics Knowledge Base matches genotype and phenotype information for research into how genomic variations lead to variations in response to drugs.

Two broad integration projects – BioMOBY funded by Genome Canada, and myGrid backed by the UK's Engineering and Physical Sciences Research Council – provide sophisticated systems where scientists can browse, identify, and access multiple bioinformatics services including databases. In October, these groups looked at the progress each had made in software development and agreed to merge the best of their tools in a unified registry system supporting both projects.


MyGrid had created a tool called Taverna for selecting databases to use in a search, explains BioMOBY project leader Mark Wilkinson of the University of British Columbia's iCAPTURE Centre in Vancouver. BioMOBY had created extremely good tools for executing the ensuing searches, he says. "To the end users, this should all be invisible, and they will simply get the data they want without knowing which system they are using."

Before the entire process can be completely automated, however, the two groups must solve the most difficult component: analysis of pooled information from databases set up with incompatible interfaces and different ontologies. "The data integration problem is something nobody has solved yet," notes Wilkinson.

The field is maturing out of necessity, says myGrid director Carole Goble of the University of Manchester. Pointing, clicking, cutting, and pasting are no longer efficient ways to integrate data from high-throughput searches across multiple databases, she says. "Imagine doing that two-and-a-half thousand times with the same sequence. That doesn't work unless you want to burn out every student in your lab."

MyGrid aims to serve bioinformaticians who have the expertise to conduct experiments in silico, Goble explains. Though some bench biologists also access the service, she warns that their not understanding the high-level middleware available to users could lead to bad science. "You can download our Taverna workbench and with one click install three hundred and fifty services," she says, referring to distributed tools. "... Now you've got to think about the experiment. If you're a pure bench biologist, you still need a bioinformatician to help you."

Omnigene, another project aimed at standardizing biological data exchange on the Web, has gone into limbo since its head, Brian Gilman, left the Whitehead Institute in Cambridge, Mass., to start his own consulting firm, Panther Informatics. Some of its open-source technology is being absorbed into the National Cancer Institute's cancer bioinformatics infrastructure objects (caBIO) model, according to Gilman.

In March, meanwhile, the European Bioinformatics Institute in Cambridge, UK, launched BioMart, a "distributed-data warehouse" being built for scientists who want to conduct their own searches. "BioMart is trying to cut out the middleman between the researcher and the Web databases," says project head Arek Kasprzyk. Biologists will need to learn only one Web interface to query all the databases accessible via the emerging platform, says Kasprzyk, who describes BioMart as "a one-stop shop for scalable querying of biological data, or a set of universal query interfaces, which can be applied to any database." BioMart's site says it has already been applied to UniProt Proteomes, Macromolecular Structure Database, Ensembl, Vega, and dbSNP.


The European Molecular Biology Organization's E-BioSci data network is more mature though still a work in progress. Based in Heidelberg, Germany, E-BioSci uses the Online Research Information Environment for the Life Sciences (ORIEL) project to develop tools with subcontractor Collexis, a software-development company in the Netherlands. These complimentary projects, funded by the European Commission, aim to go where the two leading literature search engines have not, according to Les Grivell, who heads up E-BioSci and ORIEL.

"PubMed is a well-curated database. Google is not curated at all. It is a beautiful jungle that continues to grow," Grivell says. Neither empowers a scientist to search for a concept – what the E-BioSci system encodes numerically as a "fingerprint" – expressed in the scientific literature. Google's search engine typically runs a popularity contest, listing the most visited sites first, while burying those of more limited scientific interest. PubMed can search on key words in a concept but not put them together in a phrase such as "chemicals that induce tumors." A semantic Web forerunner, E-BioSci was conceived to do that, and if the first set of answers is off the mark, Grivell says the user could adjust the scoring system to give more weight to specific "currents that mean something biologically."

E-BioSci includes the ability to recognize aliases for genes that have different names in different species, and to conduct scientific literature searches in multiple languages. Currently, Grivell says, it can look for the same concept in English, French, and German (for which a proposed spelling reform is being closely watched). The capability to search Continental and Latin variations of Spanish is under development. While E-BioSci's ultimate goal is to serve the broad biological community, skilled users have the edge for now. "Despite our efforts it probably will remain difficult for quite some time," Grivell concedes. "We are working not on gluing things together but communicating."



Courtesy of

Lincoln Stein's DAS, or Distributed Annotation System, allows clients (like the EnsEMBL genome browser shown here) to retrieve genome annotations from multiple remote sources and display them in a unified format. In this screenshot, DAS annotations are displayed as browser "tracks."

So how can a bench scientist working in a wet lab keep up with the field today? Altman's recommendation: Don't wait for the big data integration projects to become user-friendly, or for your institution's bioinformatics support staff to get around to helping you. Senior scientists should send their fellows and graduate students to bioinformatics courses, or hire someone who has the skills to search across databases to find the information the lab needs. "If you don't have informatics in your lab, it is going to be hard to compete," he says, adding, "I don't think we should confuse whether it is easy for biologists to do something with whether it is possible today."

Richard Hyman of the Stanford Genome Technology Center (SGTC) also sees the databases "turning the way science [has been] done, since long before Newton, upside down." Hyman is concerned, however, that broad access can lead to data poaching before analyses can be completed and published by the laboratories that generated them. He raised the issue in a controversial letter to Science,5 and his group has posted a data-release policy on the SGTC Web site6 (See also an opinion by Hyman on p.8). The center releases raw information in compliance with government grant requirements, Hyman explains. In turn, it expects other scientists to allow the original investigators to have a complete first pass at their own work.

As an example of unethical behavior, he offers the hypothetical case of a researcher who discovers 10 to 20 putative tyrosine kinases while searching a freely accessible, raw-sequence database. If this finding is published, a journal might subsequently refuse to publish the original group's more thorough analysis. "We say you have not done an experiment," Hyman says, arguing against publication of the tyrosine kinase search. The issue is not whether data should be posted online, but how it can be used. "Attention has to be paid to the ethical issues," he says. "... What can you do ethically [with] the data in the databases?"

Many scientists share Hyman's concern and withhold data that they are not obliged to share, says Altman. "The fear is these informatics guys with a lot of skills can go in very quickly and do very powerful things," Altman says. "Scientists are very nervous that they should be able to publish the principal findings from their data before they turn it loose." The counterbalance has been a "carrots-and-sticks" approach to prying information from lab benches, he says. The sticks are wielded by government grants and journals that require posting of data. Meanwhile, service providers offer carrots such as visualization capabilities that a lab might not otherwise have.

Stein also acknowledges the problem but urges a nuanced approach. Data in a database is free for any scientist to use in a published analysis, he says. The best course, however, is to alert the original experimenters and invite them to participate as collaborators. "Usually, you'll get the data and permission to use it. You may even get help with it," Stein says. "In the long run, you'll do bioinformatics much more good than trying to do it in a tricky and sneaky fashion."

In any event, all the problems are not likely to be resolved any time soon. The price of success – more data posted, annotated, and integrated – is more work for users and service providers.

With so many disparate data sources and tools available, Grivell warns that users need to keep data trails as they work, if only to be able to understand years later how they reached their results. "What databases did you visit? What information did you get? How did you use it?" he says. "This is just the tip of the iceberg. With scientists generating huge amounts of data, you are going to have trouble keeping track."

For now, however, bioinformatics skills can give ambitious junior scientists an edge over senior scientists at the top of the field, says Altman. "I've met many biologists who are really frozen. They know what's in the databases. They don't have the skills to get to it, and they don't have anybody in the lab who does," he says.

For scientists building integrated systems, there's a sense of running in place, says Stein. "It's like painting the bridge," he says. "When you get to the other side, you have to start all over."

Popular Now

  1. Antarctica Is Turning Green
  2. Male Fish Borrows Egg to Clone Itself
  3. How to Tell a Person’s “Brain Age”
  4. A Coral to Outlast Climate Change