To appreciate a natural wonder such as a mountain range or a canal system on Mars, an observer must stand back. So it is with the human genome. As annotation progresses, some researchers are stepping back to better see patterns within the sequence. In addition to offering clues about humanity's biology and origins, the research generates positive feedback where discoveries fueled by the sequence enable researchers to refine annotation.
The human genome sequence is providing a broader, aerial view of some classic features: deletions and duplications, pseudogenes, and the forces that have melded groups of linked gene variants into blocks of shared variations known as haplotypes. While patterns of deletion and duplication mirror deep time as the primate family tree branched, pseudogenes and haplotypes illuminate our more recent heritage. And all three hallmarks of the genome may underlie or explain certain medical conditions. As on the red planet, exploration...
DELETIONS AND DUPLICATIONS
In the pre-genome era, known mutations defined the ends of a spectrum, from single gene disorders to rearrangements large enough to alter chromosomal banding. Genome sequence comparisons reveal deletions and duplications that fall between these extremes.
Deletions are associated with many syndromes that include mental retardation. Joris Veltman of University Medical Center Nijmegen, The Netherlands, and colleagues used array-based comparative genomic hybridization (CGH) to probe 3,500 bacterial artificial chromosome (BAC) clones that span the genome, finding small deletions or duplications in seven of 20 individuals with mental retardation but normal karyotypes.1 These microdeletions and microduplications were beneath the radar of standard cytogenetics.
When invented in 1997, CGH scanned metaphase chromosomes, which are too condensed to yield much information. The new and improved version uses arrays of BACs that sample the genome in readable pieces. "A well established human genome sequence with an integrated clone map is essential for array-based CGH, since the technology is strongly dependent on the availability of well characterized genomic clones," says Veltman. And, he adds, array CGH can highlight some sequencing errors.
Deletions and duplications between species may house evolutionary history. Researchers at Perlegen Sciences in Mountain View, Calif., discovered 33 deletions or duplications, ranging from 1 to 8 kilobases, by probing nearly 3,000 pieces of human chromosome 21 and its counterpart, chimp chromosome 22.2 Most of the deletions and duplications are in or near protein-encoding genes, suggesting that copy number differences between the two genomes may account for key differences in gene expression. In addition to limited amino acid sequence divergence, "differences in the timing and amount of expression of genes are what make a human a human, and a chimp a chimp," says Perlegen's Kelly Frazer.
Larger genome pieces yielded similar findings. Evan Eichler of Case Western Reserve University and colleagues used array CGH to examine 40- to 175-kilobase sequences that duplicated or disappeared in recent primate evolution.3 The group scrutinized 2,460 BACS covering 12% of the human genome and discovered 63 sites of copy number variation among humans, bonobos, chimps, gorillas, and orangutans – many near active genes. Some of the large duplications that cause the meiotic mispairing behind some syndromes also played a role in evolution. "Our study found that some of the genomic differences among primates involved events that might have been mediated by these duplicated sequences," says Dan Pinkel of the University of California, San Francisco.
GHOSTS OF GENES PAST
Other evolutionary clues come from pseudogenes, sequences just different enough from those of functional genes that they are not translated into protein, although a few are transcribed. They arise in two ways: chromosomes misalign during cell division and a sequence is copied twice (non-processed), or mRNA is reverse transcribed and plunked down in a chromosome (processed). Processed pseudogenes are like ghosts of once highly expressed genes.
Yale University's Mark Gerstein likens them to relics. "The population of pseudogenes is very different from the population of genes, telling us something about the proteins that were alive in the past, much like fossils tell us about the populations of organisms that were alive in the past."
Once considered oddities in well studied gene families, these silenced sequences are now the subjects of vast hunts. Gerstein and coworkers looked at gene sequences left in the genome after subtracting the known proteome, overlaps, repeats, fragments of functional genes, and sequences containing more than one exon. They found 8,000 processed pseudogenes.4 Peer Bork's team at the European Molecular Biology Laboratory in Heidelberg identified nearly 20,000 pseudogenes by including some nonprocessed sequences and using less stringent criteria for divergence of the encoded amino acid sequences.5
Bork says that 20,000 is an underestimate and that pseudogenes may actually outnumber real genes. Uncounted pseudogenes may lie within introns or repeats, or be misclassified as active. Other unrecognized pseudogenes may have diverged so much from their ancestors that they escape the homology screen. "Large portions of the noncoding genome have once been pseudogenes and have been mutated to an extent that we only see sort of random 'junk,"' Bork says. Additionally, identified pseudogenes can be stripped from the collections of protein-encoding genes, further improving the genome's annotation.
HAPLOTYPES AND HOT SPOTS
Courtesy of Mark Cerstein
Genes can take two paths to duplication. Processed duplications are acquired through retropositioning of a spliced mRNA. Non-processed duplications contain intronic information, but like-processed pseudogenes, are disabled and degraded by mutation.
Another hallmark of the human genome is its patchwork of haplotype blocks and recombinational hotspots, a phenomenon called linkage disequilibrium (LD). Dampened crossing over preserves gene order in the haplotype blocks, where single nucleotide polymorphisms (SNPs) define them.
The landscape view of LD has evolved in step with annotation and SNP discovery. A preliminary report in mid-2002 from the SNP Consortium described "a surprisingly simple pattern: blocks of variable length over which only a few common haplotypes are... punctuated by... recombination."6
A year later, with more data, a more complex topology emerged. Andrew Clark of Cornell University and colleagues analyzed 4,833 SNPs in 538 clusters and found that it isn't an all-or-none phenomenon.7 "Some areas of the genome are 'warmer' or 'cooler' recombinationally, and there is good evidence for a few strong hotspots. But there is also background recombination pretty much throughout the genome," says Leonid Kruglyak, a Howard Hughes Medical Institute associate investigator and a researcher at the University of Washington.
Haplotypes could have arisen in two ways: as areas between hot spots, or as a consequence of rapid population growth that allowed little time for crossing over. "'Blocks' are really defined operationally for mapping studies, and it is unclear whether they correspond to anything deeper biologically. Some block boundaries may represent hotspots, while others may represent random ancient recombination events," says Kruglyak.
Frazer uses a comparison to describe how the echoes of an ancient population bottleneck may ring through our genomes today: "In the 1920s, the population of elephant seals off the coast of California dwindled to about 100 individuals. Due to habitat protection, the population is now in the hundreds of thousands, and all the individuals are genetic clones of those 100." The human version of the seal scenario was the exodus from Africa some 100,000 or so years ago. "Linked blocks probably arose simply out of the demographics of human evolutionary history," she adds.
Probably both an uneven recombinational terrain and population dynamics molded today's LD landscape. Some studies combine genome sequence information with simulations of past recombination rates to reconstruct the story for a particular chromosomal region. For example, Eric Anderson and Montgomery Slatkin at the University of California, Berkeley, estimated that for part of chromosome 5, population growth was the primary force holding certain allele combinations together as haplotypes.8
It will take many such studies, and the HapMap project, to fill in the recombination maps that have been a standard genetic tool since Alfred Sturtevant, as an undergraduate in the famed "fly lab" at Columbia University, used crossover frequencies to chart the first genetic maps in 1911. "We still don't have a really good linkage map showing the recombination points and where the hot spots are," says Frazer.
The flurry of recent papers on deletions and duplications, pseudogenes, and haplotypes reinforces the realization that sequencing the human genome was not an end, but a beginning. Sums up Gerstein, "The human genome is more complex than our expectations might have been from model organisms. The human genome is simply a different beast."
Ricki Lewis