More efficient protein structure determination is a major goal of the US structural genomics projects. X-ray crystallography shines in its ability to determine structures quickly from data gathered at synchrotron beam lines, but only a fraction of proteins that can be produced in soluble form crystallize to yield X-ray structures. Many smaller proteins that fail in crystallization trials can yield solution structures by nuclear magnetic resonance (NMR), however.
NMR structures are fundamentally different from their X-ray-derived counterparts. For one thing, the proteins are in solution rather than locked in crystals; the resulting structures are thus more reflective of the molecules' natural (in vivo) state. Moreover, NMR provides something X-ray structures cannot: valuable information about protein dynamics, chemical properties, and ligand binding.
But NMR has problems, too. The process is computationally intensive, and bottlenecks in data collection and analysis mean structure determination can take weeks or even months. Now we and others are working to automate and accelerate the process via a suite of interactive software tools based on new ideas and mathematical algorithms.
Consider the proverbial problem of finding a needle in a haystack. Our plan is to isolate the "needle" based on its intrinsic properties of size, density, and iron content. First we process the haystack through a sieve with holes just longer than the needle; after this step we can specify that the needle is more probably with material that passed through the sieve than with material that did not. Next, we put the sieved material in water; now it is more probable that the needle is at the bottom of the tank than with floating material. Finally, to further sharpen our probability of success, we use a magnet to separate the needle from the rest of the nonfloating material.
Such an approach is called "probabilistic" – that is, errors and probable alternatives are quantified at each step along the way1 – and we believe it represents the best way to streamline protein NMR. Conventional NMR approaches, in contrast, can deal with probability in single steps but are much less systematic in carrying probable alternatives through multiple steps.
Our idealized tools should also be iterative, allowing automated, continuous refinement and evolution of a burgeoning model in light of new data and analysis, and should enable optimal use of information from databases and experiments. The hope is that our toolset will ensure the quality of routine NMR structure determinations while reducing their time and cost.
SMART DATA COLLECTION
Anyone who's taken a basic organic chemistry class is familiar with simple one-dimensional (1D) NMR spectra: graphs of "chemical shift" versus intensity. Chemical shift is characteristic of a given nucleus' chemical environment; thus, the chemical shift of a proton in a CH3 group differs from that of a proton in a CH2Cl group. In protein NMR, we actually have three nuclei we can use – 1H, 13C, and 15N – and the patterns of chemical shifts from different residue types (e.g., histidine and isoleucine) are distinct. Yet, because proteins contain a repeating backbone structure and a relatively small variety of side chains, the spectra quickly become cluttered as resonances overlap.
We solve this problem by going to higher dimensionality in our experiments. A 2D spectrum, for instance, might correlate the chemical shifts of nuclei connected by one or more chemical bonds, or correlate chemical shifts of pairs of hydrogens that are very close to one another in space. Three-, four-, and even five-dimensional experiments are possible. But, as dimensionality increases, so too does experiment and processing time.
In the past few years others have shown that some information in higher-dimensional NMR spectra can be extracted from a reduced set of 2D spectra.23 This approach can lower data collection times significantly, thereby freeing the instrument to run more experiments. But without an optimal strategy to select the 2D spectra, such approaches can be inefficient and may require considerable processing time to identify peaks and their positions.
We have used elements of this approach to create an adaptive and interactive engine for identifying peaks in 3D NMR spectra in a probabilistic manner. Our method, called High-Resolution Iterative Frequency Identification (HIFI)-NMR,4 eliminates the spectral reconstruction step and concentrates on finding the best model for the peak positions. Optimal 2D spectra are chosen on the fly. In so doing, the algorithm can in two hours create automatically a statistically annotated peak list that would take more than 20 hours to produce by conventional means.
Courtesy of John Markley
HIFI-NMR identifies the positions of peaks in a 3D NMR spectrum from a minimal series of 2D experiments. A preliminary 3D spectral model is generated from two spectra acquired at orthogonal angles (0° and 90°, shown at left), based on the assumption that 3D peaks appear at any point where 2D peaks (blue) intersect. To refine this model, the program automatically calculates the optimal angle for tilted 2D plane collection, acquires these peak positions (red), and recalculates the model. Additional tilted 2D plane data are collected, and the process stops when the program calculates that all information on peak positions has been extracted (green peaks). Three tilted planes are shown in this example.
How is this done? Imagine we want to identify the positions of small, motionless fish in an aquarium. From the front, you cannot see all the fish; some are hidden behind others. So, we start by taking a snapshot from the front of the aquarium and a second snapshot from one side. From this information, we calculate which angle between the two would optimally allow us to solve the problem and take a third snapshot at this angle. We then ask whether another snapshot at another angle would improve our detection of all the fish and their coordinates. If so, we repeat the last step; if not, we're done.
The HIFI software is tasked with the job of identifying the chemical shifts of all the peaks that would appear in a 3D spectrum. After each new 2D spectrum is acquired at a different angle, the software reviews all data and updates its model. It also judges when all possible peaks have been captured. In practice, we identify in this manner better than 96% of the peaks that can be found by earlier, more tedious methods.
Our probabilistic approach to peak assignments is embodied in a software package called PISTACHIO (Probabilistic Identification of Spin sysTems and their Assignments including Coil-Helix Inference as Output).5 It uses as input the protein sequence and peak lists of the kind generated by HIFI-NMR or other data-collection methods; it outputs chemical-shift assignments represented as a series of minimum-energy configurations ranked according to their probable correctness, with multiple attributions in ambiguous cases.
With typical data sets, initial PISTACHIO runs yield reliable backbone assignments for more than 90% of the amino acid residues and reliable sidechain assignments with greater than 75% completeness. The assigned data are then run through the LACS (Linear Analysis of Chemical Shifts) software,6 which corrects possible referencing problems and identifies assignment outliers.
The software package called PECAN (Protein Energetic Conformational ANalysis)7 carries out probabilistic secondary structure determination, while another tool called ALMONDS (Arranging Local Manifolds of N-mers to Define Structure), currently in development, defines the protein's probabilistic torsion angle restraints. In the planning stage are faster and more automated ways of collecting and analyzing nuclear Overhauser effect data, which report "through-space" distances, to determine the protein's overall conformation.
John L. Markley
The final step is to take all these data, along with information available from the current database of protein structures, to infer the most likely positions of the protein's atoms in 3D space. We know this approach works: Nilges and coworkers recently used a Markov-chain Monte Carlo simulation to demonstrate inferential structure determination from a more limited set of input parameters than those we hope to make available.1 And, by carrying forward the statistical information from all these steps, we will reveal what we do and do not know (that is, our confidence) about a given structure.
Labeled proteins for NMR analysis can be prepared in a day or two by automated cell-free protein-production technology.8 By tying together the probabilistic steps for NMR data collection and analysis, we envision that, in five years, routine protein structure determinations with targets smaller than 20 kDa could be completed in two or three days following preparation of a double-labeled sample. Today our best record is two weeks, but most structures require more time.
Will automation take all the fun out of biomolecular NMR? I think not. NMR assignments and structures are really just first steps toward understanding the chemical and dynamic properties of a protein that underlie its function. By making them faster and more objective, we will be free to concentrate on many more interesting issues – for instance, protein biology itself.
John L. Markley is the Steenbock Professor of Biomolecular Structure in the biochemistry department at the University of Wisconsin, Madison. He directs the Center for Eukaryotic Structural Genomics and is head of both the National Magnetic Resonance Facility at Madison and the BioMagResBank. Markley's lab focuses on protein sequence structure-function relationships and the technology of biomolecular NMR spectroscopy. The tools described in this article are available at