The human genome consists of three billion base pairs that encode some 100,000 to 300,000 genes. Could we work out all the sequence of this DNA? What use would that information be? The sequence alone would not tell us, today, what the genes were and how they function.
A major goal of biology is to solve the structure-function problem: to be able to predict from the DNA sequence what the structure of a protein might be-and, ultimately, how it might function. The solution to this problem, which may still take several decades, means that we would be able to interpret DNA sequences directly, without further experimentation. Knowing the human sequence by itself doesn't solve this problem, but the information would serve as a massive database for tests of solutions.
A human genome project begins with the construction of a physical map: an ordered collection of 40,000 base pair-long DNA segments (cosmids), each carried in a different bacterial strain, that covers the entire genome. Some 100,000 such cosmids would be needed. Then scientists could establish the sequence of each of these DNA segments of the physical map, and thus the complete sequence of the human genome.
How might this be done? The physical map of cosmids can be assembled by cloning and fingerprinting techniques in a period of two to three years by a group of approximately 50 technicians. This would simultaneously establish a restriction enzyme map of each chromosome and the entire genome.
Complete sequencing, however, involves a series of improvements in technology. I expect that within a few years, our technology will be able to sequence one megabase/technician-year. At that rate 100 technicians could sequence the genome in 30 years. An effort to improve the technology over a 10-year period should raise the rate by a factor of 10. As the physical map develops, we should use current technology to sequence regions of highest interest, such as ones with newly identified genes. We could then sequence entire smaller chromosomes, moving to sequence the bulk of DNA later.
What are the benefits? One goal of biology is to understand the implications of the human genetic sequence. In a way we cannot yet translate, that sequence contains all the information needed to specify a human being. Ultimately, we want to know how genes direct the formation of organs, assemble a body, maintain it, and damage it if they go wrong. The human genome sequence is a tool to be used in this study. Having both the physical map and the complete sequence would speed up critical aspects of our attempt to learn how genes function.
Today when scientists identify the protein product of a gene of biological or medical interest, they spend years, first to find and to clone the corresponding DNA fragments and then to work out the fragments' structure by sequencing, before they can turn to the interesting biology of what that gene does. If a new gene is identified solely by genetics, by recognizing from an inheritance pattern that some gene controls a human disease or human function, the process is even longer. First we may struggle to locate that gene on the DNA by finding ever-closer genetic markers, and then we must go through a DNA isolation and sequencing before turning to the biological problem.
Knowing the ordered physical map and the sequence would simplify or eliminate all these intermediate steps. The physical map itself would speed research by cutting six months or a year of effort out of every isolation of a human gene. The full sequence would mean that at the moment scientists isolate a new protein and work out a bit of its amino acid sequence, they could identify in a database a portion of its gene and the region of the genome from which it came. This leads directly to isolation and full characterization of the gene for the protein.
Genetic mapping will be used in medicine not just to identify the genes that correspond to obvious genetic diseases, which are rare, but to identify gene clusters that influence common diseases. Given a set of markers, genetic factors in major killers, such as heart disease, could be localized. To do this, scientists need to be able to correlate the physical map with a genetic map based on restriction fragment-length polymorphisms.
A genetic map of some 300 markers would permit us to localize new genes to within 10 centimorgans, or about 107 base pairs. A fine genetic map of some 3,000 markers would permit localization to within one million base pairs. Such an area would contain 10 to 100 genes. The information in the human genome sequence provides many ways to identify which sections in such a million-base pair-long region are genes, and which gene might be relevant.
A human reference sequence, created once for all through a highly focused effort, would thus speed all of the work in human biology. The human gene sequence is a tool to use as we work toward complete understanding of human genes. Ultimately, we will be able to read out the genes directly.
A 1980 Nobel laureate in chemistry, Gilbert is the H.H. Timken Professor of Science at Harvard University, Cambridge, MA 02138.
Our knowledge of the human gene complement, its content and organization, is being revolutionized by molecular genetics. It is now within our grasp to characterize the 10,000 or so basic genetic functions, to obtain a complete map of the genes along the 23 pairs of human chromosomes, and eventually a complete sequence of human DNA.
The genetic map is the equivalent of a dictionary and provides a unique ordering of the genes. The map is fundamental for applications of molecular genetics to understanding the functional basis of essentially all the major chronic diseases, including cancer and mental, heart and autoimmune diseases. Gene mapping has been the key, for example, to identifying the role of oncogenes in leukemias and other cancers, to elucidating the genetic (and, eventually, functional) basis for the inherited eye tumor retinoblastoma, and to establishing the molecular basis for color-blindness. The identification of markers for the inherited diseases cystic fibrosis, Huntington's, and Duchenne muscular dystrophy has made possible pre natal diagnosis of these diseases, and provides the basis for establishing their molecular defects.
Recombinant DNA techniques have made available an essentially unlimited variety of genetic markers at the DNA level. These can be used to locate genes controlling susceptibility to complex diseases or to normal variation (for example, in facial features or behavior). Given a knowledge of the complete human gene sequence, there is no limit to the possibilities for analyzing and understanding all aspects of human genetic variation, which underlie understanding of essentially all the major human chronic diseases, as well as normal variation between people.
Establishing the total human genome sequence is surely a major challenge that must be undertaken worldwide. It is probably the most important project of its kind that is now within our grasp, in terms of its potential contribution to human welfare. It is achievable and enormously worthwhile, has no defense implications and generates no case for competition between laboratories or nations.
The first step in establishing the human genome sequence would be to analyze its overall organization by obtaining a complete set of overlapping clones. This provides a framework for the genome dictionary, along which the regions coding for the different protein products could be identified long before the complete DNA sequence has been established.
At each step along the way to the complete sequence, scientists will obtain valuable information that can be applied to the problems of studying normal and abnormal human variation. The project as a whole would stimulate research in a variety of important areas. The project will include development of techniques for obtaining overlapping DNA clones and for automating sequences, the study of specific sequences of interest, applications to the analysis of particular human variations and comparative studies between species, computer analysis and development of approaches for handling large genetic databases, and the ultimate challenge of predicting structure (and so function) from the DNA and protein sequence.
Quite apart from its application to the analysis of human variation, the project will provide information of enormous interest for unravelling the evolutionary relationships and hierarchies between gene products within and between species, and will reveal the control language for complex patterns of differential gene expression during development and differentiation. The project would support much good basic research. As a whole, in fact, it subsumes much of what is now done in molecular genetics.
Many estimates have been made of the total cost of obtaining the human gene sequence. These range up to a total of about $2 billion-spread, of course, over a number of years. This cost is not high compared to many defense projects, to putting a man on Mars, or to building a new generation of super-accelerators. Some of this money is already going to support existing molecular genetics projects.
The major challenge is to coordinate activities of scientists working in this field worldwide. Coordination will insure there is a maximum return for effort, and a maxi mum exchange of information and materials directed toward the ultimate goal of establishing and understanding the whole human gene sequence. The scientists and the public must be convinced of the enormous value of this project. Once they are, its cost should no longer be a major issue. Spread worldwide and in relation to the potential benefits, especially in comparison with other major expenditures, the cost is minimal, and the stimulus of extra support to this area of re search is of immense value.
Knowledge of the total human genome sequence has profound implications, not only for the analysis, prevention and treatment of disease, but also for better understanding of normal variation, which is important for dealing with the wider problems of our present-day society. It is my hope that there will be a worldwide consensus to pursue this most worthwhile, challenging, but achievable goal, and that the scientists will take the lead in its coordination. The end of the century is a realistic target. We should call it Project 2000.
Sir Walter directs research at the Imperial Cancer Research Fund Laboratories, London, WC2A 3PX.