The Human Genome project sequenced “the human genome” and is widely credited with setting in motion the most exciting era of fundamental new scientific discovery since Galileo. That’s remarkable, because in important ways “the human genome” that we have labeled as such doesn't actually exist.
Plato essentially asserted that things like chairs and dogs, which we observe in this physical world, and even concepts like virtues, are but imperfect representations or instances of some ideal that exists, but not in the material world. Such a Platonic ideal is “the human genome,” a sequence of about 3 billion nucleotides arrayed across a linear scale of position from the start of chromosome 1 to the end of the sex chromosomes. Whether it was obtained from one person or several has so far been shrouded in secrecy for bioethical reasons, but it makes no real difference. What we call the human genome sequence...
Nor is the human genome we have a “'normal” genome. What would it mean to be “normal” for the nucleotide at position 1,234,547 on chromosome 11? All we know is that the donor(s) had no identified disease when bled for the cause, but sooner or later some disease will arise. Essentially all available whole genome sequences show potentially disease-producing variants, even including nonfunctional genes, in donors who were unaffected at the time.
Furthermore, the current reference genome sequence is haploid, which means that even if it were compiled from just one donor, the single reference sequence does not report the variation at millions of nucleotide positions between the donor’s two copies (except for X and Y) that we know exist. I understand that the DNA template is being resequenced, to be reported as a diploid sequence, which is progress. Hopefully this will be done in a way that produces phased sequence, in which each chromosome is reported separately, rather than just identifying the two alleles at each variable site along the genome without specifying on which chromosome it lies. Only the former format will represent sequences as they actually exist in the sequenced person, identifying which alleles go together on a chromosome, and are thus linked evolutionarily.
Even so the reference human genome will keep changing! Corrections and refinements of problematic regions that are technically difficult to sequence are made, though nobody claims it will ever identify 100 percent of the 3.1 billion nucleotides without mistakes. But forgetting such minor errors, if such a diploid sequence were obtained from a single person, rather than a composite of several, one might think we finally have an actual set of sequences rather than a non-existent Platonic ideal. That would then be like the authorized type specimens of real plants and animals in museums.
Of course, biologists realize that it’s only a reference sequence, and they think of each of us carrying “copies” of the human genome referent, with some variants of that sequence. But even that idea is wrong. Calling them copies would be Platonic, as if our individual sequences came directly, if imperfectly, from this ideal as their shared template. More accurately, we should use a term like “instance” rather than copy. But a fundamental point is that the resemblance among instances is not due to descent from a single ideal, but for the evolutionary reason that they are homologous, that is, are from a chain of descent from the gene’s common ancestor. Homology is not the manifestation of an ideal, because the original ancestral instance really did exist.
Biologists take advantage of this fundamental fact of life when inferring ancestral sequences from the observed variation in today’s populations. One might suggest that instead of a rather arbitrary reference sequence from some donor, “the human genome” sequence should be this inferred ancestral sequence. But that doesn’t work either. The ancestral sequence for human genes usually goes back far beyond the origin of humans, and the ancestral sequences for each gene will have existed in vastly different times, places, individuals, and species. The intervening noncoding sequences between such genes, which is generally less constrained by natural selection, vary so much that we often can’t really guess their ancestral state. Further, genes have been rearranged among the chromosomes over time, so that gene A and B that are chromosomal neighbors in human genomes today may have been on entirely different chromosomes in the past, or vice versa. Finally, the ancestral gene may have been so different from today's that using that as our reference would not serve the biomedical research community from a functional point of view.
The same is true to a lesser extent even among modern human genomes: in addition to single nucleotide variation, millions of bits of DNA large and small have been deleted, inserted, inverted, or rearranged in every human genome instance. This variation, and the variation that will continue to accrue in the human population, distances us from any single reference sequence even further.
Reference sets?
If a single reference sequence, even the ancestral sequence that really did exist, is problematic, could there be a better way? An appealing possibility is to use a set of DNA sequences, perhaps all known instances, to characterize human genomes. Instead of a single string, suppose we represented each part in the format of what is known as a gene or sequence "logo." Here is an example:
This shows the relative frequency of each nucleotide at every point along the sequence. One would have to add a way to visualize insertions and deletions and so on, but computer technologies should be up to that task. If “The Human Genomes” sequence was portrayed in this way, we might replace our arbitrary type-specimen with more natural, biologically accurate, population thinking. Efforts are under way to create a biological reference along these lines.
Of course, a reference like this would have to be constantly updated, and still could not keep up with the changing frequencies at each position as people die and babies are born all the time. But there’s a more important and even deeper problem—with Platonic implications. Every time an individual cell divides, new mutations arise; no two cells even within any individual have the identical sequence. Because of this somatic mutation, the single sequence obtained from each individual is an imperfect representation even of that person’s genome. We can never know the variants in each of his/her billions of cells.
Coming to terms with Plato
We routinely use an arbitrary reference and/or ancestral sequences in our daily research. We develop phylogenies, and identify variation responsible for traits, including disease. We comparatively consult arbitrary references for humans and mice to design experiments that work only because of our evolutionary relationship. As limited human beings, we cannot grasp everything in our heads, and representations and reference guidelines are immensely useful.
In fact, in many ways, the human reference genome is an ideal, but not in the way Plato had envisioned ideal. In a deep and interesting way, he had things backward. His idea was that we are only able to see imperfect images, of ideals that really have some separate existence. But actually, the ideals are neural constructs built inside our very material heads, and it is they that are imperfect representations of the actual world, not the other way round as Plato had it.
Thus, while any human reference genome may be far from perfect, it’s what we have to work with today, and it helps shed light on all aspects of human biology. Representations are fundamental to science. The danger is if we don't understand them and they become misrepresentations.
Ken Weiss is a geneticist and evolutionary biologist at Penn State University. A fuller discussion of these points is available at The Mermaid’s Tale, a blog to which Weiss is a contributor.