When hobbyist Chobei Zenya wrote The Breeding of Curious Varieties of the Mouse in 1787, he probably never imagined the impact that mouse breeding would eventually have on biomedical and genetic research. During the past two centuries, the "fancy mice" once prized by breeders have evolved into the multitude of inbred strains now used to study complex genetic traits and to model human diseases. In the past 40 years, mouse biology has exploded, as scientists added to the heap with transgenics, knockouts, and cloned mice. This issue's Hot Papers chronicle the next milestone in the life of Mus musculus, its genome.
In December 2002, the Mouse Genome Sequence Consortium (MGSC), led by Eric Lander and Kerstin Lindblad-Toh at what is now the Broad Institute (then the Whitehead Institute/MIT Center for Genome Research) and Robert Waterston, then at the Genome Sequencing Center at Washington University in St....
MOUSE VERSUS HUMAN
The MGSC, comprising researchers from more than 20 institutions in six countries, grew out of the Human Genome Sequencing Consortium. "There's a substantial overlap in the analysis team," explains Waterston, now at the University of Washington in Seattle.
The mouse genome was a quicker, cheaper proposition than the human genome, says Michael Zody, MGSC member and Broad's chief technologist, "because we had invested the time and the money into building ... a technological base and an engineering base and a group of people who knew how to do this work."
The human genome has the most direct implications for human health and biology, says Zody, but comparing it to the mouse uncovers more about genome structure and function. Comparisons revealed multiple deletion/insertion differences and showed that the two mammals share only half the sequence from their last common ancestor. "They're not much different in terms of genes," says Waterston, "but quite different in terms of DNA content." Of the 50% of the genome they have in common, researchers found that only 5% is under selective pressure and less than 2% codes for proteins. "Evolution says that it cares about 5% of the mammalian genome, and only a third of it is protein-coding genes," says Lander, now Broad Institute director.
Lander suggests that the other 3% could encode regulatory or structural information. Such information might include instructions for transcription and splicing regulation, intra- and intergenic interactions, and RNA control mechanisms. David Haussler, director of the Center for Biomolecular Science and Engineering at the University of California, Santa Cruz, says, "There may be more information about gene control or chromatin structure than there is protein-coding information in the mammalian genome, in stark contrast to the genomes of bacteria and the simpler eukaryotes." Haussler, an MGSC member, likens noncoding genomic sequences to Cindy Magee computer logic code, which controls timing and interactions between functions in complex systems. "Anybody who's had experience with computer coding knows that most of the lines of code are basically control logic, so it wouldn't be surprising to me to see that true at the genomic level."
THE COUNT OF MUS MUSCULUS
Defining the difference between protein-coding RNAs and those that work at a control level may require a full listing of what's being transcribed. RIKEN's transcriptome analysis was part of their Mouse Gene Encyclopedia Project, an effort to isolate and sequence novel full-length mouse cDNAs. In 2001 RIKEN researchers published FANTOM1, their first set of 21,076 clones; FANTOM2 includes an additional 39,694 cDNAs.
In order to determine an accurate mouse gene count, RIKEN scientists first needed to clarify terminology. "The definition of the word 'gene' is ambiguous," says Hayashizaki. "If two genes are mapped to the same region but don't share any exons, how do we count?" Instead, his team divided the genome into "transcriptional units" (TUs), or segments of DNA from which several transcripts are generated.
Initial analysis of the FANTOM clones and additional public data indicated that the mouse genome contains 37,086 TUs, of which 20,487 are protein coding; but when genomic material not represented by their cDNAs or other public database information was included, the estimate increased to approximately 70,000 TUs. Of the genes examined, 41% were alternatively spliced, which shows that "the number of proteins and the number of transcripts is much greater than the number of genes," says Hayashizaki. The team also found more than 16,000 nonprotein-coding RNAs.
Zody says that having a large set of confirmed transcribed sequences is helpful for doing gene predictions and sorting genes from pseudogenes. "A large set of EST or cDNA data ... is a vital resource for doing a good annotation and a good analysis of a new genome," he says.
Six months after its publication, the FANTOM2 collection had generated enough excitement to garner its own issue of Genome Research in June 2003. Hayashizaki coauthored an article on the DNA Book, a new method for distributing the RIKEN cDNA clones, in which samples are printed on water-soluble paper to be dissolved and amplified by PCR. A FANTOM3 project to expand the RIKEN clone library is already underway.
GREAT EXPECTATIONS
Although the complete genome won't be finished until the end of 2005, scientists have already put the draft sequence to work. Concurrently with the mouse genome draft analysis, researchers from Whitehead/MIT and the Wellcome Trust Sanger Institute analyzed the genomic structure variation between different mouse strains using single nucleotide polymorphisms (SNPs) and found long segments of extremely high and extremely low polymorphism rates.
Now researchers worldwide are studying genetic differences between strains to find loci responsible for disease or involved in complex traits. The National Institutes of Health issued a request for proposals to resequence 15 inbred strains of mice in January 2004. Novartis Research Foundation scientists constructed a database of 2 million SNPs from 48 strains and have started work on a mouse haplotype project. Mark Daly and colleagues at the Broad Institute expect their mouse haplotype map to be complete within the next two years. The Wellcome Trust Centre for Human Genetics researchers recently published a haplotype analysis of eight inbred strains, which indicates that their genomes might not fall into clear haplotype blocks.
Publication of the mouse genome also prompted a host of bioinformatics projects. The Jackson Laboratory's Mouse Genome Informatics project now comprises six smaller database projects, including the Mouse Gene Expression and Mouse Tumor Biology databases. The Mouse Phenome Database project, an international collaboration headquartered at the Jackson Laboratory, aims to define the phenotypic characteristics of the most common mouse strains.
Others view Mus musculus as a contribution to the broader field of comparative genomics. Human geneticist and MGSC member John McPherson, now at Baylor College of Medicine in Houston, says that "Rosetta Stone-style analyses" are still needed. Comparing sequences from mouse, rat, chicken, chimpanzee, and human yields much more information than looking at a single sequence in isolation. Ultimately, he adds, "we still don't speak the language of DNA."
"Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs," Okazaki Y, Nature Biotechnology Vol 4206915, 563-573 Dec. 5, 2002
"Initial sequencing and comparative analysis of the mouse genome," Waterston R.H., Nature Biotechnology Vol 4206915, 520-562 Dec. 5, 2002