New Database Expands Number of Estimated Human Protein-Coding Genes

Some scientists are not yet convinced that the list is accurate.

Jun 19, 2018
Diana Kwon


The human genome may contain more protein-coding genes than prior analyses suggested. A study published last month (May 29) on BioRxiv provides an expanded database of approximately 5,000 novel genes—of those, around 1,000 code for proteins, expanding the estimated number of protein-coding genes from around 20,000 to 21,000.

“If people like our gene list, then maybe a couple years from now we’ll be the arbiter of human genes,” study coauthor Steven Salzberg, a computational biologist at Johns Hopkins University, tells Nature.

Salzberg and his colleagues compiled a catalog of human genes and transcripts using data from the Genotype-Tissue Expression (GTEx) project, in which scientists sequenced the RNA from various tissues in hundreds of human subjects. By comparing the sequenced RNA to the human genome, the researchers were able to compile a database of 43,162 genes—21,306 of which coded for proteins, and 21,856 were noncoding genes.

According to Nature, this dataset includes many more genes than currently existing datasets. For example, the GENCODE gene set, a widely used human gene database run by the European Bioinformatic Institute (EBI) in the U.K., includes 19,901 protein-coding genes and 15,779 noncoding ones.

Some scientists say more evidence is required to verify that that the new gene list is accurate. For example, Adam Frankish, a computational biologist at the EBI involved in the GENCODE project who was not involved in the study, tells Nature that after carefully analyzing about 100 of the newly identified protein-coding genes, he and his colleagues found that only one of those seems to truly code for protein.

Salzberg tells Nature that having an accurate gene count is important, because uncounted genes are frequently ignored—meaning those containing disease-causing mutations may be overlooked. On the other hand, Frankish tells Nature that hastily adding genes could also be problematic, because they may divert scientists’ attention away from the genes that are actually involved in a disease.