The human reference genome is a DNA blueprint used as a standard for comparison in basic research and clinical settings. Despite improvements in accuracy and completeness that have been made over the years, it still harbors limitations that can result in erroneous findings.
In the current version of the reference, called GRCh38 or Build 38, 93 percent of the sequence comes from just 11 individuals and 70 percent from just one man, resulting in a lack of diversity and at least 300 million missing letters of DNA. In addition, a small percentage of the genes in the reference genome are represented by alleles that are not the most common forms of the genes.
To address these issues, some scientists are developing a new reference, called the pangenome or graph genome, that contains a vast collection of genomes representing all possible DNA sequences for any given locus. But representing these data—the 3 billion bases in one person, times the hundreds to thousands of individuals whom scientists seek to include—is extremely complicated.
The trouble with a pangenome is that incorporating it into existing research practices and software would be a huge undertaking because it requires graphical representation as opposed to a single, linear genome. For instance, the methods used for transcriptomics, which can tell scientists which genes are active in a particular cell, would need a complete overhaul.
While absolutely a decrease of a factor of two to three sounds like an impressive difference, in reality, it’s going from what I would say is exceptionally good to slightly beyond exceptionally good.—Jesse Gillis, Cold Spring Harbor Laboratory
“Most of the methods that do transcript expression analysis, they work on, or they expect as input, a single sequence like a single reference genome. They don’t expect a graph,” says Christina Boucher, a bioinformatics researcher at the University of Florida. “That’s a huge jump in the input. So the methods that actually do the transcript expression would have to be redeveloped, then, in order to take in a graph rather than a single reference. The algorithms in and of themselves would have to be redeveloped.”
That’s why researchers such as Jesse Gillis, a computational biologist at Cold Spring Harbor Laboratory, came up with a new idea: the “consensus genome.” It is still a single genome just the like current reference, but it represents the most common alleles among thousands of individuals rather than whatever the few individuals used to make the current reference happened to have in their DNA. This allows for a nearly painless adoption as far as using it in existing genome analysis software, says Gillis.
In a preprint posted to bioRxiv on December 22, Gillis and his colleagues, including Alexander Dobin of Cold Spring Harbor Laboratory who developed the popular RNA sequence analysis software STAR, compare their consensus genome to the current reference genome, as well as to population-specific consensus genomes they created representing both superpopulations such as East Asian and subpopulations such as Han Chinese in Beijing.
They created consensus genomes using the 1,000 Genomes Project database, which contains more than 2,500 genomes across 26 subpopulations, grouped into five superpopulations. They tested how GRCh38 and each consensus genome performed during transcriptomics using STAR, to see if improvement in the input reference genome would improve gene expression analysis.
Like DNA analysis, data received during RNA sequencing come in pieces called reads. To determine where these pieces come from in the genome, researchers often match these reads to a reference genome, a process known as mapping or alignment. Then they can count up how much messenger RNA there is for each gene to quantify gene activity.
As a baseline, Gillis and his colleagues first aligned the reads from one individual to his or her own genome and measured gene expression. They then did the same using the reference and consensus genomes and compared the results to the baseline, quantifying the differences, or amount of error, between them.
They found that, while the inaccuracies produced by the reference genome during alignment and gene expression measurement are minor, according to Gillis, the consensus genomes had even fewer errors. Specifically, compared with the reference genome, the consensus genomes yielded an improvement in the mapping error rate from around 9 percent to around 4 percent. And because errors in mapping will result in errors when counting up messenger RNA, the reference also generated mistakes in gene expression measurement in almost six times as many genes as the consensus did.
“While absolutely a decrease of a factor of two to three sounds like an impressive difference, in reality, it’s going from what I would say is exceptionally good to slightly beyond exceptionally good,” says Gillis. “And that should be a relief because we’ve been doing science for a long time using the reference. If we discovered that this was a life changing difference, it would be worrying.”
Gillis and his team also found that the population-specific genomes only had a marginal improvement in error reduction beyond the general consensus, a maximum difference of about 1 percent. This suggests that having dedicated references for each population may be unnecessary for RNA sequencing analysis.
This is good news for Elizabeth Atkinson of Massachusetts General Hospital and the Broad Institute of MIT and Harvard who studies admixed populations whose recent ancestry comes from multiple sources. She says that not only would population-specific genomes make it difficult to compare individuals with multiple ancestries to each other, but just assigning people to those groups is challenging.
“If you have someone of mixed ancestry, which ancestry do you choose for their population consensus genome?” says Atkinson. “The population is getting more mixed over time, so it makes sense to me that, if the pan-species [consensus] option appears to perform effectively as well [as the population-specific consensus], that would get around some of those wrinkles in terms of comparing across populations and deciding how to even assign people to their correct population.”
Although Gillis says he believes that other researchers would be able to replicate these consensus genomes fairly quickly, he and his colleagues have developed software so they can build their own consensus and perform RNA sequencing. The programs are free, open-source, and available on GitHub.