If you want to know how biology will be practiced in the coming decades, check out a recent National Academy of Sciences colloquium on frontiers in bioinformatics.1 Assembling data is no longer the biggest challenge, says meeting cochair Russ B. Altman, citing sophisticated presentations on pseudogenes, RNA splicing, and molecular evolution. Instead, the major hurdle these days is one of data integration.
Biologists are beginning to form hypotheses that require the computational skills to mine 20 genomes at once, to compile and analyze data from multiple sources that were never intended to be mutually compatible. For now, it's an online game of cut-and-paste that favors younger, more computer-savvy biologists.
But bioinformaticians are working to smooth the process, so that scientists at the bench can worry less about how to get a piece of data, and more about what they want to do with it once they find it....
KNUCKLES AND NODES
In 2003, Lincoln D. Stein of Cold Spring Harbor Laboratory proposed a "knuckles and nodes" approach to the growing problem of data integration.4 The nodes would be focused databases, and the knuckles single-task services that enable scientists to link information from one node to another. So, for instance, a knuckle might keep track of orthology relationships while each node would be a database for a species with a specific ortholog.
The infrastructure emerging from the compendium of integration efforts is not a mirror image of Stein's proposal, but it does bear some resemblance. Two models are in the lead, says Stein, who developed Wormbase, the
Courtesy of Mark Wilkinson
In BioMOBY a client (user) can interact with multiple sources of biological data regardless of the underlying format or schema. In a typical query, servers register their services with MOBY Central. A client, with a piece of data in hand, queries MOBY Central for services that can accept that type of data as input. Once MOBY Central returns one or more such service descriptions, the client chooses one, submits its data to that service, and receives another form of data in return.
Two new initiatives from Stanford could be examples of the latter. Prolinks
Two broad integration projects – BioMOBY
BEST OF BOTH WORLDS
MyGrid had created a tool called Taverna
Before the entire process can be completely automated, however, the two groups must solve the most difficult component: analysis of pooled information from databases set up with incompatible interfaces and different ontologies. "The data integration problem is something nobody has solved yet," notes Wilkinson.
The field is maturing out of necessity, says myGrid director Carole Goble of the University of Manchester. Pointing, clicking, cutting, and pasting are no longer efficient ways to integrate data from high-throughput searches across multiple databases, she says. "Imagine doing that two-and-a-half thousand times with the same sequence. That doesn't work unless you want to burn out every student in your lab."
MyGrid aims to serve bioinformaticians who have the expertise to conduct experiments in silico, Goble explains. Though some bench biologists also access the service, she warns that their not understanding the high-level middleware available to users could lead to bad science. "You can download our Taverna workbench and with one click install three hundred and fifty services," she says, referring to distributed tools. "... Now you've got to think about the experiment. If you're a pure bench biologist, you still need a bioinformatician to help you."
Omnigene
In March, meanwhile, the European Bioinformatics Institute in Cambridge, UK, launched BioMart
A WORK IN PROGRESS
The European Molecular Biology Organization's E-BioSci data network is more mature though still a work in progress. Based in Heidelberg, Germany, E-BioSci uses the Online Research Information Environment for the Life Sciences (ORIEL) project to develop tools with subcontractor Collexis, a software-development company in the Netherlands. These complimentary projects, funded by the European Commission, aim to go where the two leading literature search engines have not, according to Les Grivell, who heads up E-BioSci and ORIEL.
"PubMed is a well-curated database. Google is not curated at all. It is a beautiful jungle that continues to grow," Grivell says. Neither empowers a scientist to search for a concept – what the E-BioSci system encodes numerically as a "fingerprint" – expressed in the scientific literature. Google's search engine typically runs a popularity contest, listing the most visited sites first, while burying those of more limited scientific interest. PubMed can search on key words in a concept but not put them together in a phrase such as "chemicals that induce tumors." A semantic Web forerunner, E-BioSci was conceived to do that, and if the first set of answers is off the mark, Grivell says the user could adjust the scoring system to give more weight to specific "currents that mean something biologically."
E-BioSci includes the ability to recognize aliases for genes that have different names in different species, and to conduct scientific literature searches in multiple languages. Currently, Grivell says, it can look for the same concept in English, French, and German (for which a proposed spelling reform is being closely watched). The capability to search Continental and Latin variations of Spanish is under development. While E-BioSci's ultimate goal is to serve the broad biological community, skilled users have the edge for now. "Despite our efforts it probably will remain difficult for quite some time," Grivell concedes. "We are working not on gluing things together but communicating."
RECOMMENDATIONS AND CONCERNS
Courtesy of EnsEMBL.org
Lincoln Stein's DAS, or Distributed Annotation System, allows clients (like the EnsEMBL genome browser shown here) to retrieve genome annotations from multiple remote sources and display them in a unified format. In this screenshot, DAS annotations are displayed as browser "tracks."
So how can a bench scientist working in a wet lab keep up with the field today? Altman's recommendation: Don't wait for the big data integration projects to become user-friendly, or for your institution's bioinformatics support staff to get around to helping you. Senior scientists should send their fellows and graduate students to bioinformatics courses, or hire someone who has the skills to search across databases to find the information the lab needs. "If you don't have informatics in your lab, it is going to be hard to compete," he says, adding, "I don't think we should confuse whether it is easy for biologists to do something with whether it is possible today."
Richard Hyman of the Stanford Genome Technology Center (SGTC) also sees the databases "turning the way science [has been] done, since long before Newton, upside down." Hyman is concerned, however, that broad access can lead to data poaching before analyses can be completed and published by the laboratories that generated them. He raised the issue in a controversial letter to
As an example of unethical behavior, he offers the hypothetical case of a researcher who discovers 10 to 20 putative tyrosine kinases while searching a freely accessible, raw-sequence database. If this finding is published, a journal might subsequently refuse to publish the original group's more thorough analysis. "We say you have not done an experiment," Hyman says, arguing against publication of the tyrosine kinase search. The issue is not whether data should be posted online, but how it can be used. "Attention has to be paid to the ethical issues," he says. "... What can you do ethically [with] the data in the databases?"
Many scientists share Hyman's concern and withhold data that they are not obliged to share, says Altman. "The fear is these informatics guys with a lot of skills can go in very quickly and do very powerful things," Altman says. "Scientists are very nervous that they should be able to publish the principal findings from their data before they turn it loose." The counterbalance has been a "carrots-and-sticks" approach to prying information from lab benches, he says. The sticks are wielded by government grants and journals that require posting of data. Meanwhile, service providers offer carrots such as visualization capabilities that a lab might not otherwise have.
Stein also acknowledges the problem but urges a nuanced approach. Data in a database is free for any scientist to use in a published analysis, he says. The best course, however, is to alert the original experimenters and invite them to participate as collaborators. "Usually, you'll get the data and permission to use it. You may even get help with it," Stein says. "In the long run, you'll do bioinformatics much more good than trying to do it in a tricky and sneaky fashion."
In any event, all the problems are not likely to be resolved any time soon. The price of success – more data posted, annotated, and integrated – is more work for users and service providers.
With so many disparate data sources and tools available, Grivell warns that users need to keep data trails as they work, if only to be able to understand years later how they reached their results. "What databases did you visit? What information did you get? How did you use it?" he says. "This is just the tip of the iceberg. With scientists generating huge amounts of data, you are going to have trouble keeping track."
For now, however, bioinformatics skills can give ambitious junior scientists an edge over senior scientists at the top of the field, says Altman. "I've met many biologists who are really frozen. They know what's in the databases. They don't have the skills to get to it, and they don't have anybody in the lab who does," he says.
For scientists building integrated systems, there's a sense of running in place, says Stein. "It's like painting the bridge," he says. "When you get to the other side, you have to start all over."