Recognizing that much of the cell's work is done not by individual proteins but by large macromolecular complexes, researchers increasingly are trying to map protein-protein interactions throughout the cell. This map of the
If you want a sense of one of the hottest trends in biology today, open the hood of your car. At first, the jumble of boxes, wires, circuitry, and hoses that meets your eyes more likely will confuse than inform. Careful examination, however, exposes an intricate order, in which modular processes interact with each other to build ever-larger systems, culminating in a working automobile.
A cell presents an equally confusing array of parts. But like the car, the cell can be analyzed as a series of interacting systems working together to build a whole greater than its parts. "The entire cell can be viewed as a factory that contains an elaborate network of interlocking assembly lines, each of which is composed of a set of large protein machines," wrote National Academy of Sciences president Bruce Alberts in 1998.1
Over the last few years, researchers engaged in the nascent field of "interactomics" have been busy deconstructing these molecular machines, mapping protein-protein interactions in bacteria, yeast, fruit flies, nematodes, mice, and humans. The resulting charts have exposed the functions of previously mysterious proteins, have helped drug companies hone their development efforts, and are laying the foundations for systems biology.
But many researchers ask, 'If the studies are so comprehensive, then why isn't my interaction there?' Well, for starters, it's probably because investigators haven't gotten to it yet. Also, interactome mapping is still a primitive science. For all their colorful nodes and edges, these maps could well warn, "Here, there be dragons."
That's not to say they are without value, says Lewis Cantley, professor of systems biology at Harvard Medical School. By revealing potential functional linkages, they suggest molecular explanations to biological phenomena and provide avenues for follow-up. "The information is useful when viewed with an informed mind," he says.
Many of the principals involved in these projects compare the state of the field today with DNA sequencing technology in the late 1970s: immature, error-prone, yet filled with promise. But there's a key difference. In the early days of the sequencing revolution, both the software tools and the sequencing technology were crude. Today, high-throughput tools have upped the ante. Researchers are grappling with how best to visualize, validate, and interpret interaction data, even as it pours into electronic repositories.
CONNECT THE DOTS
Molecular cartographers who choose to take on the interactome should be patient, long-lived, or both, at least if they adhere to Marc Vidal's definition. The interactome, says the Dana-Farber Cancer Institute researcher, is a map of "all interactions that take place in an organism between all proteins, in all cells, all tissues, at all ages, and in response to all possible environmental conditions the organism sees in the wild."
Full interactome maps are dizzying, yet by zooming in, researchers gain insights into individual pathways. This network showcases a
All that mapping won't happen soon, considering the hundreds of thousands of proteins in each cell, and their many possible variants. Yet since 2000, scientists using yeast two-hybrid (Y2H) and affinity purification/mass spectrometry (AP/MS)-based approaches have made some headway, mapping networks in
The studies are remarkable for their scale. But their real power, says Frank Holstege, who heads the genomics laboratory at University Medical Center, Utrecht University, The Netherlands, lies in what they reveal about the underlying biology. "We're not interested in the interactions per se," he says. "We're interested in what the interactions tell us about the functions of proteins."
By that measure, these maps have in some sense fulfilled their purpose, offering peeks, through guilt by association, into unknown proteins' functions. Takashi Ito of the Human Genome Center, University of Tokyo, and colleagues used their yeast interactome map to place a previously uncharacterized protein, Ydr016c, in the pathway for spindle pole body formation, based upon its association with known members of that process.2 Independent investigators later verified that assessment.3
Drug developers, too, are seeing benefits. CuraGen, a drug development company in New Haven, Conn., published Y2H-based maps of
And proteins often do that: Just as in human society, proteins sometimes participate in multiple complexes, notes Giulio Superti-Furga of Cellzome, a drug development firm in Heidelberg. "I have a group at work, I have a family at home, I have friends with whom I play soccer. It's the same me, but I do have a different behavior and a different function in these different settings."
Drugs targeting these so-called sticky proteins have the potential to produce troubling side effects. Pharmaceutical companies, therefore, are served better by focusing on those molecules that appear limited to whichever process the company is targeting, Chant says.
Yet lingering doubts about the validity of interactome information remain. All agree that though data-rich, these studies produce lower quality results than would a scientist who devotes years to a single protein. "High-throughput means you do things fast," says Holstege. And if that happens, "You're going to make mistakes." The information, says Jeremy Gunawardena, director of Harvard's Virtual Cell Program, is "data rich and analysis poor."
Joel Bader, assistant professor of biomedical engineering, Johns Hopkins University, pegs the accuracy of high-throughput data at somewhere between 25% and 50%. "It's something people have known all along," he says. "The challenge has been to identify the good part from the bad part."
All of which makes it difficult for academic and pharma researchers alike to put too much stock in the data. Ultimately, the maps will have value, Gunawardena says. But for now, he advises would-be users, "
Y2H data especially is open to questioning, because the technique essentially amounts to a genetic trick: It relies on artificial fusion constructs, often mere fragments of the full-length protein, that are introduced into cells, forced into the nucleus, and overexpressed. False-positive results are inevitable.
But the AP/MS approach has problems too, notes Vidal, who uses Y2H in his work on
Understanding Network Topology
Albert-László Barabási, a theoretical physicist at the University of Notre Dame, and colleagues demonstrated in 2001 that protein-interaction networks are not uniformly connected, but are instead "scale-free." In other words, like the World Wide Web, they contain a continuum of node connectivity ranging from poorly connected proteins to highly connected "hubs."
This architecture effectively immunizes organisms from random mutational events, says Barabási. The vast majority, 93%, of yeast proteins makes five or fewer connections, he showed in 2001, yet only one in five such proteins is essential.1 Deletion of highly connected proteins (those joining 15 or more proteins) is three times more likely to be lethal.
Recently, Marc Vidal's team at the Dana-Farber Cancer Institute in Boston expanded this study by examining coexpression of genes linked by hubs. Their conclusion: All hubs are not created equal. Some, called "party hubs," contain proteins that interact simultaneously, whereas "date hubs" contain proteins that interact at different times and locations.2
The group infers a modular architecture to the proteome, in which date and party hubs operate at different organizational levels. Party hubs function to assemble individual molecular complexes, or modules. These modules, in turn, are linked at a higher level using date hubs. Thus, the date hub calmodulin links modules for cation homeostasis; budding, cell polarity, and filament formation; protein folding and stabilization; and the endoplasmic reticulum.
If these findings apply to metazoans, they will have both pure research and pharmaceutical implications. On the research front, an understanding of network topology and dynamics can help explain organismal responses to external stimuli. And drug developers, says Barabási, can use such information to hone their lists of potential drug targets.
- Jeffrey M. Perkel
Indeed, Y2H and AP/MS answer different questions, says Daniel Figeys, senior vice president of systems biology and lead profiling at Toronto-based MDS Proteomics, which uses AP/MS. Y2H experiments ask, 'does protein A interact with protein B' whereas AP/MS asks more broadly, 'which proteins associate with protein A?' And though interactome-focused companies tend to favor one or the other, the approaches actually complement each other, says Sudhir Sahasrabudhe, chief scientific officer of Salt Lake City-based Prolexys Pharmaceutics, a company that uses both methods.
Researchers increasingly are trying to factor these concerns into their work. One approach verifies putative interactions by crosschecking other interactomics datasets. But this method seldom strikes paydirt, as experimental conditions can skew the results. A recent analysis of yeast interaction datasets by Bader and colleagues notes, "Only 387 interactions are common to the 6,395 Y2H and 41,775 Co-IP interactions from high-throughput data."7
Others look for previously published interactions, or interactions in orthologous proteins in other species, what Vidal calls "interologs." A group led by Trey Ideker, assistant professor of bioengineering at University of California, San Diego, recently described a software tool called PATHBLAST that does exactly that.8 "What you're looking for basically is conservation based on proteins in different species that have not only similar sequences but also similar wiring in their protein-interaction networks," Ideker explains.
Some investigators try to attach a quality metric or statistical measure to their data. One simple approach is to score interactions based on the number of times they are observed. Others delve more deeply into mathematics to sift through the data. But Figeys cautions that statistical approaches can be problematic if good quality-control procedures were not in place during data collection. "It's basically garbage in, garbage out," he says.
© 2002 Elsevier Science Ltd.
The interactome provides a natural scaffold to overlay genome-level datasets. Here, a germline interactome map was combined with a transcriptome map comprising 553 microarray experiments. (Reprinted with permission,
Providing some perspective, Cantley says the question of quality was raised 20 years ago when sequencing firms began dumping cDNA fragments into nucleotide databases. People warned against using these fragments because they had mistakes and were unverified and misleading, says Cantley. But instead, the fragments proved invaluable: "There may be errors, but they can be teased out by looking at enough data."
Some researchers have turned to a new, metascale functional genomics approach for data validation. Basically, the thinking goes, if two proteins interact, they should colocalize in the cell, their genes should be coexpressed, and when those genes are knocked out, the observed phenotypes should be similar. Explains Holstege, "You need to combine completely different sources of functional genomic data and see if you can come up with additional evidence that a particular interaction is correct."
Vidal says the goal is to create an atlas of superimposable biological maps.9 The best scaffold on which to build that atlas, says Superti-Furga, is the interactome, which he compares to a corporate organizational chart. "You don't want to assemble them on a linear chromosome," he says; that would be like trying to understand a person's corporate role by the company's alphabetized directory.
In yeast, at least, the atlas is slowly coming together.
In metazoans, which feature introns, more genes, and inconsistent gene predictions, among other barriers, work is proceeding more slowly. Thanks to genome-wide Y2H and RNAi resources, for instance, Vidal's lab recently published an analysis of the nematode TGF-beta signaling network that merged physical interaction and double-genetic perturbation studies.10
Vidal and others say these overlays are merely suggestive, not absolute. That two interacting proteins are coexpressed is a positive sign, for instance, but it does not definitively say the two actually interact in vivo. Conversely, the absence of coexpression does not immediately indicate that the interaction is false. Correlations, says Holstege, only increase confidence. "It doesn't say with absolute certainly whether or not the interaction is true or false."
With so much disparate data available, the challenge for researchers is to integrate that information into a single interaction network and then to squeeze some useful biology out of it. That's where visualization tools such as Cytoscape come in.
Jointly developed by the Institute for Systems Biology, UC-San Diego, and the Memorial Sloan-Kettering Cancer Center in New York, Cytoscape incorporates interaction data with other large-scale genomic information to build networks of cellular processes and pathways. But via the Systems Biology Markup Language (SBML), it also can communicate with other SBML-enabled programs to test mathematical simulations of cellular events, for example.
Scientists debate whether such simulations can ever be detailed sufficiently enough to accurately predict cellular behavior. Roger Brent, president of the Molecular Sciences Institute in Berkeley, Calif., says it should eventually do that, at least at the pathway level. His institute's Alpha Project is an attempt to model quantitatively and predict the behavior of the yeast pheromone response.
But Ideker, a Cytoscape principal investigator, remains skeptical. "There's no way the data's anywhere near good enough, for one, and it doesn't have anywhere near the kind of quantitative and dynamic data associated with it to do that kind of modeling." Technologies are on the horizon that could overcome these barriers, but Ideker asks, "Who is to say if these will pan out?" Interactome maps display purely binary data – either an interaction exists or it does not – without regards to its timing, location, strength, direction, or consequence. So if proteins A and B form a complex, does A activate or repress B, or vice versa?
This detail of the
This, says Vidal, will be the next iteration of interactome mapping, and his team is developing approaches to address this question. For now, even academic researchers uninterested in protein-protein interaction networks could gain from these studies, says Superti-Furga, because the data are so rich. "The beautiful thing about the [cellular] wiring plan is it is not a point of arrival," he says. "It is the starting point for generating a lot of opportunities for additional experiments."
But it's also a work in progress. "We will keep generating them," says Vidal of interactome maps. "They are far from perfect, they are moving targets.... Interactions are working in so many different ways, at different times, and that makes them hard to characterize. But that's why we need maps in the first place."
Jeffrey M. Perkel can be contacted at
In addition to
When all is said and done, all the protein interactions, high-confidence and low-confidence alike, must be stored and made accessible if they are to benefit the research community at large. But like the data they were built to archive, interaction databases are not homogeneous.
The major nucleotide databases, Gen-Bank, EMBL, and DDBJ, share information regularly. The interaction databases however, do not. Inconsistent file formats and curation priorities have produced a mish-mash of partly overlapping datasets.1 The Database of Interacting Proteins (DIP) catalogs some 44,000 interactions, IntAct archives nearly 28,000, and the three GRID (General Repository for Interaction Datasets) resources jointly hold nearly 50,000.
That inconsistency makes it difficult for researchers to gauge the completeness of interactome maps. To be truly comprehensive, they must download data from several databases, convert the data into some common format, and then process it.
And that doesn't even account for the massive historical backlog. "We estimate there can be on the order of 200,000 papers in the literature currently that have high-quality, but low-throughput, experiments about molecular interaction data," says Chris Hogue, principal investigator of the Blueprint Initiative at Mount Sinai Hospital, Toronto, which runs BIND, the Biomolecular Interaction Network Database.
BIND, like several competing databases, has a team that manually logs papers into the archive to supplement high-throughput data. Hogue expects that his staff of about 20 people can curate 80,000 of those papers within three years.
In February, the Human Proteome Organization's Proteomics Standards Initiative (HUPO-PSI) unveiled its Molecular Interaction data model, an XML-based standard that its developers hope will foster the exchange of protein-protein interaction data between databases.1 It also should make it easier for users to incorporate data from disparate archives, says Lukasz Salwinski, a research associate at the University of California, Los Angeles, who works on DIP.
- Jeffrey M. Perkel