Getty Images / David De Lossy
Peer reviewers for the National Institutes of Health are faced with the impossible. They are asked to evaluate applications that are too complex and too long in an amount of time that is too short. The process requires them to provide a score with a level of precision that is not defensible, and calls for consensus about topics that usually do not allow for definitive positions. It is not surprising, therefore, that the system does not work well in identifying innovative projects. It does, however, work effectively in funding established investigators proposing reasonable, if obvious, projects.
Most NIH grant applications run 15 to 25 pages and contain highly technical and arcane descriptions of a proposed investigation in small, single-spaced type. A typical reviewer receives five to 10 of these proposals for each of three meetings-per-year. They are expected, in a month's time, to prepare a detailed written report on each one. In addition to those assigned, reviewers look at all remaining applications in their study section, cursorily if at all, but all reviewers vote on every application.
Hopefully, there is no disagreement among the reviewers at study-section meetings. Uniformity becomes a necessity, because most members have not read the application and are looking for definitive guidance on how to cast their required votes. This process strongly selects against controversial ideas, heretical proposals, and unusual conceptualizations that are likely to engender disagreement.
In scoring applications, NIH uses a scale with 41 grades. It is difficult to imagine that scoring on such a large scale is precise. In spite of this unrealistic level of complexity, the scale is still not adequate. After collecting the scores, they are averaged and reported on a scale with 401 grades so that NIH can make funding decisions across a pool of more than 40,000 applications.
No one person or procedure is to blame for this quagmire. The current process was born of a logical method that had to grow incrementally with an expanding NIH bureaucracy. But attempts to fix it have failed. In 1996, NIH director Harold Varmus established the Peer Review Oversight Group (PROG). He said that NIH needed to "focus on projects that will lead to changes in how we think about science and that will encourage investigators to take more risks." Wendy Baldwin, an NIH administrator, stated in a PROG meeting that "NIH is strongly in favor of eliciting, identifying, and funding creative and novel research projects." In her view, this goal would require "a complex intervention." PROG members considered how grants were scored, how study sections were constituted, how the critiques were written, and how applicants could redress their grievances by rebuttal. They recognized the undue burden on reviewers. They discussed the scale for scoring the thousands of applications received yearly. Their response: Request the reviewers to include the degree of innovation for the proposed investigation.
This facile intervention has failed. Current NIH director Elias Zerhouni is still concerned with the capability to identify "creative, unexplored avenues of research." His NIH Roadmap is a plan constituted in part to address this deficiency, but it does not include a reform of the peer-review system.
A fundamental overhaul is necessary. I propose that when the NIH Peer Review Advisory Committee meets later this month, it should consider decreasing the length of the research plan to between two and four pages so that 20 to 30 reviews for each application could be solicited. The reviewers should provide a score based on a five-grade scale without the need to write a report. Committee meetings, which tend to reinforce herd behavior, will be eliminated. The scores' mean and variance will be calculated and used to determine funding. Applications with high scores and low variance represent excellent projects that are directed toward obvious aims. Applications with intermediate means and high variance identify controversial projects that are more likely to be innovative. Finally, applications will be grouped according to direct funds available to the principal investigator over the two years prior to submission as well as according to discipline.
Adopting this scheme will have several consequences. The decreased size of the application and the elimination of meetings and extensive written critiques will significantly reduce the essentially nonproductive effort expended in review. Simplified scoring options will be more defensible in terms of the capacity to assess gradations of quality. Increasing the number of reviews will provide a more accurate measure of the quality of the proposals, and the degree of variance in these assessments will suggest how innovative proposed studies are. Finally, grouping applications based on research money available will eliminate the competition between veterans and rookies. In this way, younger investigators will not be at such a disadvantage, and more seasoned scientists will not be as prone to coast on old ideas.
Peer review at NIH is due for significant change, not just superficial cosmetics. The structural changes proposed here will address the most egregious deficiencies of the current system.
David Kaplan (