"No, no, no," says Bo Yuan, of Ohio State University, having a laugh over the idea that computation is all you need to tally genes. To the contrary, states the director of the bioinformatics group in the division of Human Cancer Genetics at Ohio State, trawling for genes is so labor-intensive that several years may pass before researchers possess a highly accurate count.
Many were surprised earlier this year when published drafts of the human genome chopped tens of thousands of genes off the long-held notion that making a human might require about 100,000 genes. The International Human Genome Sequencing Consortium found evidence for 29,691 human transcripts.1 The commercial genome project of Celera Genomics, of Rockville, Md., found 39,114 genes.2 Now, just months later, those estimates have been revised, effectively doubled: Yuan and his colleagues suggest humans possess between 65,000 and 75,000 genes.3
|Graphic Courtesy of Graham Tyrrell, University of York|
A report in an August issue of Cell4 further indicates how unsettled the human gene count is. Researchers led by Michael Cooke and John Hogenesch at the Genomics Institute of the Novartis Research Foundation (GNF), in San Diego, find that the consortium and Celera together predict a total of about 42,000 genes. But they also find the two groups agree much less than similar gene counts might imply, especially in predictions of novel genes. This analysis is part of a GNF project to assemble a complete set of human genes for systematic study. "We were very interested in comparing the gene sets to get a nonredundant predicted set of genes," says Cooke.
Predictions Without Polish
But that is understandable, Cooke says. "The way gene finding normally works is that computers make predictions and humans polish them until they are right." The genome sequencers set aside normal procedure, "because both groups were trying to predict tens of thousands of genes. Polishing wasn't practical."
Lacking that human refinement, different gene-finding methods fetched different results. The consortium used a gene-finding program and then compared its predictions to sequences of known genes--gene finding first, then evidence, Cooke calls it. Celera researchers matched the genome sequence against known transcripts, then used software to define exon boundaries--evidence, then gene finding. Says Cooke: "Those approaches don't sound terribly different, but the results were. We were really surprised to see how few times they agreed."
Part of the explanation is that Celera was aided by its mouse genome sequence, not publicly released. Comparing genomes between species turns up genes conserved during evolution. "The consortium had only partial access to the mouse genome, so they had different evidence," says Cooke, adding that differences in genome assembly, particularly the gaps still to be sequenced, may account for other differences.
Cooke's bioinformatics lesson is that "you cannot do gene finding completely computationally;" present knowledge of sequence motifs associated with genes falls short of that. "Prediction programs are very good at finding things that look like what we have already seen and quite bad at finding things that are totally new." No criticism intended, he says: "To be fair to the genome people, their real goal was to assemble the sequence. The secondary goal was to start finding the genes, and they made it clear that these were preliminary counts."
A Different Way to Count
|Courtesy of Bo Yuan|
The new database sprang from a project last year drawing together the scattered gene expression data for chromosomes 21 and 22. Transcript information has fragmented across numerous databases, Yuan says, making a "unified universal gene index" essential for genome annotation. So, once the genome sequences came out, "we immediately decided to try our methodology on the entire genome."
Yuan's database grew in several steps. First his group assembled mRNA and EST sequences from separate databases into consensus transcripts. Consensus transcripts are preferable to individual ESTs because they provide better evidence for splicing and protein homology and align more convincingly to the genome. Then they added about 1,400 genes by comparing human sequences against rat and mouse transcript data. The most valuable rodent transcripts represented gene expression in early embryogenesis and central nervous system tissues, where human transcript data is harder to come by. Another 3,100 genes were added based on homology to amino acid sequences in protein databases. Finally, after removing redundant transcripts, they located exons uniquely on the genome using BLAST. Some transcripts, however, were not located. Many probably reflect gaps remaining in the genome drafts; others left for future placement are those with potential for several locations due to sequence homologies.
The new database placed 75,982 genes on the genome, 66,610 of which have evidence of multiple transcripts or mRNA splicing. "We find many novel exons with clear splicing evidence, on the order of twice as many as described in the other reports," Yuan says.
Confirming that predictions are real genes, known as validation, is a major reason the gene count will remain open for a while. "A prediction is just a prediction," says Cooke. "You have to validate the prediction experimentally before you can call it a gene." To do that, he looks to gene chips. His team is designing chips containing all genes predicted by Celera and the consortium. The chips will find out how many of the predicted novel genes are expressed anywhere in the body. A second validation strategy is the GNF project to assemble a complete set of human genes to run through batteries of functional assays. This is good news for gene cloners. The arrival of the genome sequence might suggest that cloners are a dying breed, but that's not Cooke's view. With only about 15,000 unique genes in GNF's kit, cloning remains a hot skill to have. "There's a tremendous need for cloners."
Yuan's group approaches validation through comparative genomics, comparing human and animal sequences to pinpoint candidate exons conserved during evolution. "We are currently using a mouse genome we recently assembled to annotate the human genome," he says.
Using transcript and protein data to revise the gene count won't diminish the importance of strictly computational gene-finding methods. Cooke, in fact, recommends liberal application of algorithmic gene detection. A present difficulty is that "we don't have a good estimate for how many genes were missed in [the genome groups'] prediction efforts," Cooke says. "If you want every gene you predict to be real, then you're going to miss a lot of stuff. If you want to be complete, then you're probably going to predict a lot of genes that aren't there." It's the old clash of signal and noise. "How you walk that line determines how many genes you predict. But at the end of the day, I think we're probably better off over-predicting and then using other tools to validate."
1. E.W. Lander et al., "Initial sequencing and analysis of the human genome," Nature, 409:860-921, 2001.
2. J.C. Venter et al., "The sequence of the human genome," Science, 291:1304-51, 2001.
3. F.A. Wright et al., "A draft annotation and overview of the human genome" Genome Biology, 2:1-18, 2001.
4. J.B. Hogenesch et al., "A comparison of the Celera and ensembl predicted gene sets reveals little overlap in novel genes," Cell, 106:413-5, Aug. 24, 2001.