Human Genes: How Many?

Counting human genes ought to be straightforward. Tracking telltale signs--motifs for promoters, translation start sites, splice sites, CpG islands--gene counters must by now be mopping up, finalizing chromosomal locations of every human gene already known, and predicting whereabouts of all the rest. Insert one human genome sequence, turn the bioinformatics crank, and genes gush out like a slot machine jackpot, right? "No, no, no,"

By | October 15, 2001

Counting human genes ought to be straightforward. Tracking telltale signs--motifs for promoters, translation start sites, splice sites, CpG islands--gene counters must by now be mopping up, finalizing chromosomal locations of every human gene already known, and predicting whereabouts of all the rest. Insert one human genome sequence, turn the bioinformatics crank, and genes gush out like a slot machine jackpot, right?

"No, no, no," says Bo Yuan, of Ohio State University, having a laugh over the idea that computation is all you need to tally genes. To the contrary, states the director of the bioinformatics group in the division of Human Cancer Genetics at Ohio State, trawling for genes is so labor-intensive that several years may pass before researchers possess a highly accurate count.

Many were surprised earlier this year when published drafts of the human genome chopped tens of thousands of genes off the long-held notion that making a human might require about 100,000 genes. The International Human Genome Sequencing Consortium found evidence for 29,691 human transcripts.1 The commercial genome project of Celera Genomics, of Rockville, Md., found 39,114 genes.2 Now, just months later, those estimates have been revised, effectively doubled: Yuan and his colleagues suggest humans possess between 65,000 and 75,000 genes.3

Graphic Courtesy of Graham Tyrrell, University of York

A report in an August issue of Cell4 further indicates how unsettled the human gene count is. Researchers led by Michael Cooke and John Hogenesch at the Genomics Institute of the Novartis Research Foundation (GNF), in San Diego, find that the consortium and Celera together predict a total of about 42,000 genes. But they also find the two groups agree much less than similar gene counts might imply, especially in predictions of novel genes. This analysis is part of a GNF project to assemble a complete set of human genes for systematic study. "We were very interested in comparing the gene sets to get a nonredundant predicted set of genes," says Cooke.

Predictions Without Polish

Using the sequence alignment algorithm BLAST, the GNF team first compared the Celera and consortium gene sets with transcripts of genes listed in Refseq, a well-known curated gene bank. The two groups did "a fairly good job" of finding known genes, agreeing on about 9,300 genes, 85 percent of Refseq's total. Novel genes were another story, almost like mixed-up magicians pulling Martians and marshmallows out of the same hat: Although Celera and the consortium together predicted 31,098 new genes, only one gene in five had a place on both lists.

But that is understandable, Cooke says. "The way gene finding normally works is that computers make predictions and humans polish them until they are right." The genome sequencers set aside normal procedure, "because both groups were trying to predict tens of thousands of genes. Polishing wasn't practical."

Lacking that human refinement, different gene-finding methods fetched different results. The consortium used a gene-finding program and then compared its predictions to sequences of known genes--gene finding first, then evidence, Cooke calls it. Celera researchers matched the genome sequence against known transcripts, then used software to define exon boundaries--evidence, then gene finding. Says Cooke: "Those approaches don't sound terribly different, but the results were. We were really surprised to see how few times they agreed."

Part of the explanation is that Celera was aided by its mouse genome sequence, not publicly released. Comparing genomes between species turns up genes conserved during evolution. "The consortium had only partial access to the mouse genome, so they had different evidence," says Cooke, adding that differences in genome assembly, particularly the gaps still to be sequenced, may account for other differences.

Cooke's bioinformatics lesson is that "you cannot do gene finding completely computationally;" present knowledge of sequence motifs associated with genes falls short of that. "Prediction programs are very good at finding things that look like what we have already seen and quite bad at finding things that are totally new." No criticism intended, he says: "To be fair to the genome people, their real goal was to assemble the sequence. The secondary goal was to start finding the genes, and they made it clear that these were preliminary counts."

A Different Way to Count

Courtesy of Bo Yuan

Bo Yuan

Yuan's group counted genes by relying less on gene-finding algorithms and more on evidence of gene expression. Where Celera and the Consortium used only one or two gene expression databases to aid their counts, Yuan's group used 13--every scrap of public information on human gene expression. ("We didn't want to miss anything.") From those databases the Ohio researchers consolidated cDNA and EST (Expressed Sequence Tag) transcripts and protein sequences into an annotated human gene index called the OSU Human Genome Database. A commercial browser incorporating the index is available from LabBook, Inc., of McLean, Va.

The new database sprang from a project last year drawing together the scattered gene expression data for chromosomes 21 and 22. Transcript information has fragmented across numerous databases, Yuan says, making a "unified universal gene index" essential for genome annotation. So, once the genome sequences came out, "we immediately decided to try our methodology on the entire genome."

Yuan's database grew in several steps. First his group assembled mRNA and EST sequences from separate databases into consensus transcripts. Consensus transcripts are preferable to individual ESTs because they provide better evidence for splicing and protein homology and align more convincingly to the genome. Then they added about 1,400 genes by comparing human sequences against rat and mouse transcript data. The most valuable rodent transcripts represented gene expression in early embryogenesis and central nervous system tissues, where human transcript data is harder to come by. Another 3,100 genes were added based on homology to amino acid sequences in protein databases. Finally, after removing redundant transcripts, they located exons uniquely on the genome using BLAST. Some transcripts, however, were not located. Many probably reflect gaps remaining in the genome drafts; others left for future placement are those with potential for several locations due to sequence homologies.

The new database placed 75,982 genes on the genome, 66,610 of which have evidence of multiple transcripts or mRNA splicing. "We find many novel exons with clear splicing evidence, on the order of twice as many as described in the other reports," Yuan says.


Yuan avoids calling the index entries genes, preferring to call them transcript clusters, a careful term referring to how cDNAs and ESTs from different databases are grouped together based on homology. "They should be genes, but we don't have the evidence yet," he says. "We still have to confirm that all those transcripts and ESTs that align with the genome are functional." He notes that there could be some false positives, pseudogenes for instance. Pseudogenes are hard to identify computationally and the extent of their transcription is unknown. Other errors may come from confusing transcript orientation, so that the same transcripts collected from different databases are mistaken as different genes.

Confirming that predictions are real genes, known as validation, is a major reason the gene count will remain open for a while. "A prediction is just a prediction," says Cooke. "You have to validate the prediction experimentally before you can call it a gene." To do that, he looks to gene chips. His team is designing chips containing all genes predicted by Celera and the consortium. The chips will find out how many of the predicted novel genes are expressed anywhere in the body. A second validation strategy is the GNF project to assemble a complete set of human genes to run through batteries of functional assays. This is good news for gene cloners. The arrival of the genome sequence might suggest that cloners are a dying breed, but that's not Cooke's view. With only about 15,000 unique genes in GNF's kit, cloning remains a hot skill to have. "There's a tremendous need for cloners."

Yuan's group approaches validation through comparative genomics, comparing human and animal sequences to pinpoint candidate exons conserved during evolution. "We are currently using a mouse genome we recently assembled to annotate the human genome," he says.

Using transcript and protein data to revise the gene count won't diminish the importance of strictly computational gene-finding methods. Cooke, in fact, recommends liberal application of algorithmic gene detection. A present difficulty is that "we don't have a good estimate for how many genes were missed in [the genome groups'] prediction efforts," Cooke says. "If you want every gene you predict to be real, then you're going to miss a lot of stuff. If you want to be complete, then you're probably going to predict a lot of genes that aren't there." It's the old clash of signal and noise. "How you walk that line determines how many genes you predict. But at the end of the day, I think we're probably better off over-predicting and then using other tools to validate."

Tom Hollon ( is a freelance writer in Rockville, Md
1. E.W. Lander et al., "Initial sequencing and analysis of the human genome," Nature, 409:860-921, 2001.

2. J.C. Venter et al., "The sequence of the human genome," Science, 291:1304-51, 2001.

3. F.A. Wright et al., "A draft annotation and overview of the human genome" Genome Biology, 2:1-18, 2001.

4. J.B. Hogenesch et al., "A comparison of the Celera and ensembl predicted gene sets reveals little overlap in novel genes," Cell, 106:413-5, Aug. 24, 2001.

Supplemental Materials

Image: Genome browser (19K)

Popular Now

  1. Major German Universities Cancel Elsevier Contracts
  2. Grass Routes
    Features Grass Routes

    Researchers are discovering a suite of new locations and functions of endocannabinoid receptors that play roles in sickness and in health.

  3. Studies Retracted After UCLA Investigation
  4. Trump Nominates Sam Clovis to Lead USDA Research