How Many Genes Do Humans Have?
Researchers disagree on the number of genes in the human genome, in part because what exactly a gene is can be difficult to determine.
To determine how many genes are in the human genome, researchers first needed to produce a complete genomic sequence. The Human Genome Project (HGP), spanning between 1990-2003, was a publicly funded initiative led by an international consortium of researchers in order to comprehensively study DNA and generate the first human genome sequence. These researchers were not alone in this endeavor, and in 2001, the Human Genome Project team and scientists from the private company Celera Genomics each published their own nearly complete draft sequence.1,2 In 2004, Human Genome Project researchers published the full human genome and estimated that the human genome contained between 20,000-25,000 genes.3 This number was much smaller than earlier estimates that suggested anywhere from 50,000-100,000 genes.4 Scientists now estimate that the number of protein-coding genes is around 20,000,5 but that number still fluctuates because the definition of a gene is up to interpretation.
What Is a Gene?
A gene is a unit of inheritance, passed from parent to offspring, made up of DNA. Depending on the definition, a gene may be considered a stretch of DNA that acts as instructions for protein production. Regions of DNA that do not code for proteins may also be considered genes if they produce noncoding RNA with biological functions.
Protein-coding versus noncoding genes
When deciding the number of genes in the human genome, Human Genome Project researchers initially counted protein-coding genes—regions of chromosomal DNA that are transcribed into RNA and translated into proteins. “On top of the 20,000 protein-coding genes, we have another 15,000 or 20,000 noncoding genes,” said Steven Salzberg, a computational biologist at Johns Hopkins University. Noncoding RNA genes (ncRNA) are transcribed but are not translated.
Another gene definition may decrease the number of noncoding RNA included in the human gene count; a gene can be considered any section of the genome that produces a functional RNA or that is transcribed and translated into protein. According to Salzberg, scientists only know the functions for less than five percent of the thousands of known noncoding RNA sequenced. “[Some] might just be noise,” said Salzberg. “We have to drop this assumption that something being transcribed means that it's a functional gene. We have to do something else to figure out that it's a gene.”
There are numerous types of noncoding RNA, including the following.6
- Transfer RNA (tRNA)
- Ribosomal RNA (rRNA)
- MicroRNA (miRNA)
- Small interfering RNA (siRNA)
- PIWI-interacting RNA (piRNA)
- Small nucleolar RNA (snoRNA)
- Small nuclear RNA (snRNA)
- Long noncoding RNA (lncRNA)
- Enhancer RNA (eRNA)
Splice variants
Alternative splicing affects a majority of human genes, which results in potentially numerous isoforms that may or may not have significant biological functions, further complicating the gene estimate.7 “In theory, we can make many more proteins from our 20,000 genes,” Salzberg said. “The number of different protein sequences we have ranges from about 80,000 to 120,000, and we're still exploring how many of those are really functional.”
Gene databases
To explore lists of known genes, researchers typically turn to the two gene databases: RefSeq, which is maintained by the National Center for Biotechnology Information (NCBI), and Ensembl/Gencode, which is maintained by the European Molecular Biology Laboratory (EMBL).4 Additionally, scientists create alternatives to the two major lists, such as CHESS, a human gene catalog developed by Salzberg and his colleagues at Johns Hopkins University that added some new genes and more than 100,000 new gene isoforms to existing databases upon its release.8 These gene collections are constantly being updated as new data becomes available, and they do not agree in their counts of protein-coding genes, noncoding RNA genes, and other RNA and pseudogenes.4
Refining the Human Genome Sequence
The Telomere-to-Telomere Consortium
After the Human Genome Project completed, having a sequence at hand allowed researchers to identify genes along with other genomic elements. However, the “complete” sequence had hundreds of gaps, mostly consisting of highly repetitive DNA that was challenging to sequence. Over time and with advances in sequencing technology, scientists in the Telomere-to-Telomere (T2T) Consortium traversed the final eight percent of the human genome, sequencing repetitive heterochromatic regions, areas near the centromeres and telomeres, and remaining gene-encoding euchromatic regions. They published the first fully complete sequence of a female genome in 2022, followed by the Y chromosome in 2023, and identified new genes in each dataset.9,10
Future refinement
While scientists still do not know the exact number of genes in the human genome, Salzberg is optimistic that new technologies will help refine the gene catalog. This is particularly important for medical purposes. “I've worked with some pediatric geneticists who are trying to figure out why children have particular genetic diseases. If there's a gene that's not annotated, it's not known. They won't look at it,” said Salzberg. “We'd like to be able to reassure them [that] all the genes are known.”
- Lander ES, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860-921.
- Venter JC, et al. The sequence of the human genome. Science. 2001;291(5507):1304-1351.
- International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431(7011):931-945.
- Salzberg SL. Open questions: How many genes do we have? BMC Biology. 2018;16(1):94.
- Amaral P, et al. The status of the human gene catalogue. Nature. 2023;622(7981):41-47.
- Zhang P, et al. Non-coding RNA and their integrated networks. J Integr Bioinform. 2019;16(3):20190027.
- Wang ET, et al. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456(7221):470-476.
- Pertea M, et al. CHESS: A new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 2018;19(1):208.
- Nurk S, et al. The complete sequence of a human genome. Science. 2022;376(6588):44-53.
- Rhie A, et al. The complete sequence of a human Y chromosome. Nature. 2023;621(7978):344-354.