Five years have passed since the National Institutes of Health launched the Protein Structure Initiative (PSI), a 10-year, $600 million effort to accelerate structural genomics. The project's pilot phase emphasized technical development and ended earlier this year; the second (production) phase is now underway. Now, $270 million dollars and 1,200 structures later, it is instructive to reflect and see what we have learned.
The mission of the PSI is to make the three-dimensional, atomic-level structures of most proteins easily available from knowledge of their amino acid sequences. At the moment this is impossible: we simply don't know what many protein folds look like. But by systematically determining the structure of an exemplar of each sequence family, we hope to give researchers the resources to overcome this deficit. An added benefit of this effort is the development of tools to accelerate experimental structure prediction, whether by X-ray crystallography or nuclear magnetic resonance (NMR).
The Initiative's pilot phase funded nine research centers with the tripartite goals of demonstrating the feasibility of the high-throughput approach, solving a significant number of non-redundant protein structures, and preparing for a subsequent production phase. To date, the PSI-1 centers have solved more than 1,200 structures – that's about 5% of all the structures deposited into the PDB during the period 2000–2004. (The PDB currently holds just under 33,000 protein structures.) About 65% of the PSI structures are non-redundant and thus are appropriate representatives for their sequence families, which have over 100 members each, on average.
The determination of so many protein structures would not be feasible without major improvements in the methods and instruments required for the production and structural determination of proteins (see sidebar). Impressive technology developments have come from all nine PSI pilot centers and other PSI research grants, covering every step of the pipeline – from the initial target selection step to the final post-structural analysis of family and functional relationships.
Overall, the pilot phase demonstrated that structural genomics pipelines can be constructed and scaled up and that high-throughput operation works for many proteins. The methods and instruments developed have permitted a substantial reduction in the average cost of determining a structure – down about four-fold from the initial year (which was marked by substantial infrastructure investment) to less than $150,000 in total costs per structure.
John Norvell (left) and Jeremy M. Berg (right)
Though they were designed for structural genomics pipelines, we believe these technologies will have a significant impact on biomedical research far beyond the Initiative. Many have been adapted by commercial vendors and are being incorporated into biological research labs.
The pilot centers also demonstrated that attacking several targets within the same family often yields at least one structure for the family, thus increasing the success rate and structural coverage. On the other hand, bottlenecks remain for some proteins, especially membrane proteins, and better homology modeling methods are badly needed.
PSI-2, the Initiative's "production" phase, was officially launched in July with a five-year, $300 million budget supporting 10 facilities: four large-scale research centers and six specialized research centers. The large-scale centers will work collectively in a high-throughput operation to determine about three thousand non-redundant structures, bringing about structural coverage of most of the sequenced genes. They will also continue the development of technologies to lower costs and increase success rates, while at the same time pursuing a biomedical theme project that utilizes a structural genomics approach. Selected themes include proteins that are conserved in all kingdoms of life, proteins from pathogens and higher organisms, networks of cofunctioning proteins involved in developmental biology and cancer, and protein phosphatases and multidomain eukaryotic proteins.
The six smaller, specialized centers will focus on the development of innovative methodology and technology for one or more classes of challenging proteins, including membrane proteins; proteins from higher eukaryotes, especially human; and small protein complexes. The centers also will try to overcome other major bottlenecks to high-throughput operation, especially expression and crystallization.
Two other components of PSI-2 being planned for 2006 include a Knowledge Base to provide links to analysis tools and functional annotations and a Materials Repository for the storage of expression clones produced by the research centers.
BOLD GOALS FOR STRUCTURAL GENOMICS
The PSI has set bold goals for the future of structural genomics. In its first phase, the national program catalyzed the development of technology essential for the field, leading to more efficient structural genomics pipelines. Analysis of the structures generated so far by the PSI shows that they are substantially more diverse, on average, than structures determined through traditional structural biology studies, which often focus on the determination of sets of closely related structures to probe underlying mechanisms. The tools, approaches, and structural data that have resulted from the PSI no doubt will lead to advances in related fields.
As the PSI continues into the second phase, its goals will become more targeted. Stronger emphasis will be placed on prioritizing targets based on structural coverage and biomedical significance. There are many large protein families of potentially great biomedical importance that do not have any known structural representatives; by sharing this information with the structural biology community, researchers will gain access to data that not only helps answer old questions but also helps generate new ones. A cooperative agreement mechanism will ensure that the PSI centers focus on targets of interest to the broader community.
The Initiative's ultimate goal, and where its impact on structural biology and biomedical research in general will be the greatest, is the development of new approaches for accurately modeling unknown protein structures based on sequence comparisons. This ability would dramatically cut the time and costs associated with protein structure determination. But the value of such predictions depends on both the quality of the modeling tools and the questions being asked. In some cases, a general model of the protein's overall shape can provide powerful biomedical insights; in other cases, such as the development of new drugs, much more precise modeling is required.
Through the support of research directed toward improving modeling tools, we hope to maximize the impact of the technical successes of the PSI and extend them, through various outreach activities, to the entire biomedical community.
Selected Key Technologies
Several centers have developed technologies to simplify and enhance protein expression. The Center for Eukaryotic Structural Genomics (CESG) at the University of Wisconsin, Madison, for instance, has developed and refined a wheat germ cell-free protein production system (in collaboration with the system's inventor, Yaeta Endo of Ehine University, Japan). In a head-to-head comparison with a bacterial protein expression system, CESG researchers found that the wheat germ system produced nearly twice as many proteins suitable for NMR as did the bacterial system.
C.D. Lima, at Sloan-Kettering, part of the New York-Structural GenomiX Research Center (NYSGXRC), developed another protein-expression system. The system (commercialized by Invitrogen of Carlsbad, Calif., as the Champion pET SUMO Expression System) uses topoisomerase to mediate rapid liga tion of inserts into vectors, and a SUMO moiety to enhance solubility. Following expression, a protease is used to liberate the desired protein into solution.
F.W. Studier at the Brookhaven National Laboratory (part of the NYSGXRC) has developed an auto-induction media (avail able from Novagen of Madison, Wisc., as the Overnight Express Auto Induction Systems) to simplify protein expression. Instead of having to monitor cultures and add IPTG at precisely the proper density, auto-induction systems activate protein expres sion based on cell density.
Several PSI centers have made advancements in setting up and monitoring crystallography trials. The Agincourt robotic sys tem, for instance, developed at the Genomics Institute for the Novartis Research Foundation, was utilized by the Joint Center for Structural Genomics at the Scripps Research Institute. It can set up 2,880 sitting-drop crystallization experiments per hour. Those plates are then transferred to a combined incubator and monitoring system capable of capturing 138,000 images per day at both 4°C and 20°C. Researchers at the Structural Genomics of Pathogenic Protozoa center at the University of Washington have developed a robotic system (called ACAPELLA) for protein crystal growth by free interface diffusion in plastic capillaries.
Geoffrey Waldo and Tom Terwilliger at the TB Structural Genomics Consortium have developed methods that utilize reporter proteins to indicate a protein's solubility and then engineer increased folding and solubility. This center also works with James Berger at the University of California, Berkeley, and Steven Quake at the California Institute of Technology, who developed another crystallization trial system based on a microfluidics chip. (Fluidigm of South San Francisco has commercialized the system.) Further developments under a PSI small business research project grant will permit direct X-ray analysis of the crystals on chips.
PSI researchers have also developed algorithms to predict protein domains from sequence, to automate structure determination from crystallographic data, to assign peaks in NMR spectra, and to automate the identification of structurally important motifs in three-dimensional structures. CESG researchers built a laboratory information management system to organize its workflows and resulting data, while a team at the Midwest Center for Structural Genomics has built a semi-automated system for producing reagents (and creating an audit trail) based on a wireless network and a personal digital assistant (PDA).
For a complete list of PSI-1 technologies, see
John Norvell (left) is director of the Protein Structure Initiative at the National Institute of General Medical Sciences (NIGMS), part of the National Institutes of Health. Jeremy M. Berg (right) is director of the NIGMS. Both Norvell and Berg have scientific backgrounds in structural biology.
John Norvell can be contacted at