How to make sense of proliferating proteomics data…even if you’re not a proteomicist.
Pity the poor protein biologist. DNA sequence gurus have GenBank, structural biologists, PDB. But those looking to data-mine the spectral peaks and valleys of today’s burgeoning proteomics literature are out of luck.
Or are they?
Several freely available databases are dedicated to the storage, annotation, and analysis of mass spectrometric proteomics data. Yet because they are both poorly advertised and sparsely populated, mining them to feed the sorts of meta-analyses that have become staples of gene sequence and gene expression studies largely has not been possible.
“We’re at the stage now in proteomics where genome sequencing was maybe 10 years ago or so,” says Conrad Bessant, a bioinformatics group leader at Cranfield University, United Kingdom, who coauthored a 2009 paper on proteomics databases.1 “We’re starting to get good-quality data into the databases, with data standards to share data. But I don’t see many pieces of work where people are using these databases in the way they use, for example, gene sequence databases.”
To some extent, Bessant attributes that to database maturity. The structural biology database PDB has been around since 1971; most proteomics databases are 21st century creations. “It’s a relative novelty, I suppose, and until people actually understand what’s in there and start to do some stuff, these [meta-analyses] are not done.”
To begin to rectify that deficiency, The Scientist decided to find out just exactly what is “in there” that everyday biologists can use.
If you’re going to dig into the proteomics dataverse, be prepared to spread the work around. Bessant’s 2009 paper lists more than a dozen repositories with different coverage and capabilities. Bessant recommends PRIDE (http://www.ebi.ac.uk/pride/), PeptideAtlas (http://www.peptideatlas.org/) and gpmDB (http://gpmdb.thegpm.org), which he says are “the most well-developed and the ones with the largest amount of data.” If you want the raw spectral data itself—the equivalent of sequence database sequencer trace—try Tranche (http://tranche.proteomecommons.org/).
But be advised: You’re much more likely to find a given dataset if it’s recent. Though stuffed with information—gpmDB holds over 166 million peptide identifications—these databases are not as retrospective as their nucleic acid and structural counterparts. Until recently, journals did not require the public deposition of proteomics datasets, and in any event, heterogeneous data formats complicated archiving and analysis and precluded easy dataset comparisons.
Be that as it may, the simplest query is a search by protein name or identifier. PeptideAtlas, PRIDE, and gpmDB all support searches using accession numbers from a variety of protein databases including ENSEMBL, UNIPROT, and RefSeq, as well as keywords.
Suppose you’re interested in the eukaryotic signaling protein ABL1. Enter the protein’s name in the search box on the gpmDB home page and you’ll find eight entries, one for each variant in the database. The first entry, ENSP00000323315, is the most widely seen variant, having been observed in 374 separate datasets. Selecting that link brings up its 20 best experimental observations, including maps showing the peptide fragments that were experimentally detected in each case. Clicking the map brings up a detailed report showing the precise sequence of the peptides detected, their mass, posttranslational modifications, spectral details, and so on.
Other searches are also possible. PRIDE offers a dataset browser that enables users to home in on datasets via organism, tissue, cell type, gene ontology term, or disease of interest. And PeptideAtlas can search by intracellular pathway. But only PRIDE enables searching by published reference; for gpmDB and PeptideAtlas users, however, there is a workaround: Keyword searches.
Unfortunately, such searches are hit-or-miss. One recent dataset, for instance, includes the proteomics data associated with a newly constructed yeast kinase/phosphatase interaction network.2 Searching gpmDB for one of that study’s authors, Anne-Claude Gingras, will retrieve the study. But a search of PRIDE for the same dataset comes up empty.
Proteomicists identify proteins based on the mass spectral properties of their constituent proteolytic peptides and the masses obtained by shattering those peptides into smaller chunks. By crunching the resulting numbers, software packages can assign the observed spectral peaks to the proteins that made them.
But how confident can biologists be that the reported protein really existed in the analyzed sample and that the acquired spectrum truly matched the identified protein? Probably the best prognosticator of that is the false discovery rate (FDR), a figure usually reported in the published literature that reflects the quality of the entire dataset.
“Any dataset deposited and made accessible should have an attribute of its quality and false discovery rate, so the biologist can make a judgment,” says Ruedi Aebersold, Professor for Molecular Systems Biology at the Swiss Federal Institute of Technology, who insists that accurate FDR is indispensable when assessing any proteomics dataset. Typically, datasets with no more than a 1% FDR—that is, a 1% probability that any given identification is false—are used, he says. “Otherwise, it makes no sense to do the analysis.”
Both PRIDE and gpmDB list FDR values for their datasets. But according to gpmDB head Ron Beavis, that value is not particularly informative when evaluating individual proteins. “FDR only gives an estimate as to how many peptide sequence assignments may be caused by random matches in the entire data set, it doesn’t tell you anything about the quality of an individual match.”
Instead, Beavis recommends gpmDB’s log(e) value. The gpmDB web site defines the log(e), or expectation score, as “the base-10 log of the expectation that any particular peptide assignment was made at random.” The more negative the so-called E-value, the better.
Users may also get a quick-and-dirty perspective by observing how many different peptides from a protein are detected. If only a single peptide is detected in a long protein, view the assignment with skepticism.
At the moment, most databased proteomics information is qualitative, not quantitative. Unlike gene expression data, which can be used to probe how much a given mRNA’s abundance has changed between tissues or treatment conditions, proteomics data is essentially binary: either a protein is in the sample, or it isn’t. “These are like lists of what people have seen, effectively,” says Bessant.
Still, those lists can be compared to see how they overlap. In PRIDE, a search for the UNIPROT accession number P29375 retrieves 10 experiments in which a particular protein was found. Select any two or more and check “compare experiments,” and the system will produce a Venn diagram indicating which proteins—in addition to the one you searched—are found in only one dataset, and which are found in more than one. Clicking on the appropriate region of the diagram calls up the corresponding protein list, which may be used, for instance, to identify candidate biomarkers.
For those less interested in the technical minutiae of protein identification and more interested in protein function, interaction datasets can be quite revealing.
Archived in such databases as the BioGRID (http://thebiogrid.org/) and IntAct (http://www.ebi.ac.uk/intact/), protein interactions can suggest function by molecular “guilt by association.” Yet, as with primary proteomics data, there is no single repository from which to launch a search. The smart money, says Aebersold, is to query more than one, “because they usually have a different slant or different content.”
A search of the BioGRID for ABL1 pulls up 68 unique interactors for humans and three for mice, compared to 187 total in IntAct. Both conveniently list the experimental method used to identify each interaction, an indication of whether the study employed high-throughput or low-throughput methods (the latter presumably being more reliable), and a link to the relevant publications.
Unless you’re a proteomicist yourself, you very likely want to avoid the raw mass spectral data; it’s gigabytes in size, and requires special software to view it. Instead, you’ll want to look at the processed data—simplified spectral representations that can be viewed in a Web browser yet contain sufficient information to verify a published identification or find novel modifications.
The steps involved in retrieving processed spectral data vary, but here’s how it works in gpmDB. Returning to ABL1, search for ENSP00000323315. From the list of the 20 best observations of this protein, select the first entry, which has an expectation score of –772.3 and a peptide coverage of 69.1%. The resulting page includes the protein’s full amino acid sequence, highlighting both exon boundaries (alternating black and blue) and identified peptides (in red).
To view the corresponding spectra, scroll down the page to the section called “Identified Peptides.” Peptides are ordered by their position in the sequence (from N- to C-terminus); by selecting one you can view the associated mass spectrum, essentially a stylized histogram.
Peptides are sequenced in a mass spectrometer by controlled fragmentation along the protein backbone. In gpmDB, the spectrum is color coded to correlate each peak with the particular type of peptide-bond fragment it represents (that is, where exactly on the peptide backbone that particular fragmentation event occurred). Using these data you can independently assess the protein’s identification (by consulting with a proteomics expert, for instance) or probe its protein modifications, says Ron Beavis, head of gpmDB. You can even re-analyze old data for newly discovered modifications; just select gpmDB’s “studio” link.
Really, says Aebersold, that may be the best reason for biologists to plumb such databases—”so someone can independently reconcile what the data mean.” Between new genome sequences, improved instrumentation, and newly discovered protein modifications, old proteomics datasets are likely riddled with gaps and inaccuracies, he says. “I would be absolutely certain that if someone were to reanalyze data from the year 2000, they might come up with new proteins than what the authors originally identified. It’s not that [the original authors] did something wrong, but the field has progressed.”