It’s hard to find a word in the dictionary if some pages are missing. Similarly, it’s hard to study genetic sequences if they’re absent from the human reference genome, the product of the $2.7 billion Human Genome Project, which is typically used as a guide for genomic studies.
A new study has identified more than 61,000 novel genetic sequences across 1,000 Swedish genomes that are absent from the human reference genome. Many of these sequences were also found in African and Icelandic genomes, and even the chimpanzee genome, suggesting they are ancient. The findings, published last week (September 24) in Molecular Biology and Evolution, highlight the diversity of human DNA and underscore the need for an improved reference genome that’s more representative of human genetic variation.
“It’s part of a family of papers that make relatively similar points,” remarks Jesse Gillis, a computational biologist at Cold Spring Harbor Laboratory who wasn’t involved in the research. “It’s about the reference [genome] not being reflective of what is very common in the human population.”
These are sequences that we don’t interrogate today because they are not in the human reference genome—so if they are somehow linked to disease, we wouldn’t know about it.—Anna Lindstrand, Karolinska Institute
Anna Lindstrand, a clinical geneticist at the Karolinska Institute and the senior author of the new research, is well acquainted with the reference genome’s poor representation of Swedes. Her diagnostic lab at the Karolinska University Hospital often performs genetic screens on patients to find disease-causing mutations. To do that, they sequence the patients’ DNA and align it with the human reference genome—considered to be a “normal” genome, she says—and compare changes relative to it.
However, most of the reference genome stems from a single individual. In addition, the genome may have gaps because the methods used to assemble it could have missed some hard-to-catch DNA segments. If a patient has a particular genetic mutation that can’t be found in the reference genome, that would suggest the mutation is unusual—but it may in fact be quite common across many individuals, Lindstrand explains.
To get a better idea of how much genetic variation in the Swedish population is captured by the reference genome, Lindstrand and her colleagues sequenced the genomes of 1,000 people from across Sweden. They then used a computational pipeline built by Lindstrand’s graduate student Jesper Eisfeldt to assemble these genomes from scratch, rather than by aligning them to the human reference genome.
In comparing each newly assembled genome to the reference genome they found the Swedish ones contained 1.8 megabases of genetic material that could not be mapped to GRCh37—a 2009 version of the human reference genome that is often used by clinicians. Nearly 40 percent of that genetic material also couldn’t be matched to GRCh38, a newer version of the human reference genome.
In total, across the 1,000 newly assembled genomes, the researchers counted 61,044 sequences—enough DNA to fill up chromosome 21—that were absent from either reference genome, making them “novel sequences.” Some of the novel sequences were common, but most of them were relatively rare across the study population, a fascinating aspect of the study to Lindstrand. “Even though we humans are so similar, there’s also so much diversity,” she remarks.
The novel sequences were scattered across the genomes of Swedish individuals—in genes as well as in non-coding regions. Notably, the team found handfuls of them within human disease-causing genes, she says. “These are sequences that we don’t interrogate today because they are not in the human reference genome—so if they are somehow linked to disease, we wouldn’t know about it.”
The findings didn’t surprise Lindstrand: Previous studies of African and Icelandic populations have also discovered novel sequences not present in the reference genome. To understand the origin of the novel sequences found in Swedish DNA, Lindstrand’s team compared them with those in African and Icelandic genomes and found that many were shared between Swedish, African, and Icelandic DNA.
There were still some novel sequences that didn’t align to the other human populations, so the team looked for them in the chimpanzee genome. They found that 31 percent of the Swedish novel sequences were only present in the chimpanzee genome and not in any other human genome, suggesting that they are ancient.
Perhaps those sequences were lost in the human reference genome due to a technical artifact, suggests Peter Audano, a bioinformatician at the University of Washington who wasn’t involved in the study. Or, perhaps more likely, the reference genome and other human populations deleted those ancestral sequences during human evolution, he says.
Toward an improved reference genome
Neither Gillis nor Audano are surprised by the findings. The human reference genome is stitched together from multiple individuals, but 70 percent of it is derived from a single person, Audano says. “That one person can’t represent all the diversity out there. There’s quite a bit of diversity that you’re just not going to find in any given individual,” he says.
Audano notes that the team used Illumina sequencing for its study, which isn’t the best method to get a good resolution of a given genome. It only sequences very short snippets of DNA at a time and is known to miss repetitive sequences and duplications. Long read technologies, which sequence longer strands of DNA at a time, are necessary to bridge those regions (which is why the National Institutes of Health is funding a modernization of the human reference genome using long read sequencing of 350 individuals.) However, studies like Lindstrand’s that are based on short read technologies are helpful in surveying genetic diversity across many individuals quickly and cost-effectively, he notes.
Lindstrand views constructing a new type of reference genome—a graph reference genome—as a good potential solution. This would use a normal reference genome as a backbone of a “graph” to which common genetic variants are added, in order to encompass as much variation as possible.
Gillis favors incremental improvement to the reference genome over drastic changes. “I am nervous about changing the reference too dramatically” because it will require so much change in methods and techniques used by downstream research communities that use the reference, he says. “Graph genome methods might be perfect if they worked perfectly, but that might be tricky to make happen.”
Regardless of how researchers decide to alter the reference genome, improvements will have many benefits to science, Lindstrand stresses. “By improving the reference, we will diagnose more patients and that will be very beneficial to the medical community when we move towards personalized medicine,” she adds.
J. Eisfeldt et al., “Discovery of novel sequences in 1,000 Swedish genomes,” Molecular Biology and Evolution, doi:10.1093/molbev/msz176, 2019.
Katarina Zimmer is a New York–based freelance journalist. Find her on Twitter @katarinazimmer.