Since the Human Genome Project was completed, scientists around the world have worked tirelessly to populate the sequence and variant databases that have become the crown jewels of genomics research. These databases are now brimming with genomic information, but unfortunately, they are greatly biased towards individuals of European descent. For example, 70 percent of the data stored in the Genome-wide Association Study (GWAS) Catalog, a publicly available resource that contains manually curated array-based data from more than 2,800 published studies, is from individuals of European descent. The other 30 percent comes from individuals with Asian ancestry. Similarly, the database of Genotypes and Phenotypes (dbGaP) and the Genome Aggregation Database (gnomAD) are lacking data from individuals hailing from the Middle East, Central Asia, Oceania, and Africa.
European countries such as Iceland, Estonia, and the UK are among the first to launch countrywide whole genome sequencing efforts. Hence, it’s no surprise that these databases are skewed accordingly. What matters now is that we recognize this as a problem that needs to be addressed.
This is not just an academic issue. People whose lineages are well represented in genomic databases will benefit the most from precision medicine approaches that rely on genetic markers to diagnose, understand, and treat a patient’s disease. Individuals from underrepresented ethnicities are more likely to get inaccurate diagnoses from genetic tests, which may have negative consequences. For instance, the identification rate of genetic pathogenic variants for nonsyndromic hearing loss (NSHL) has consistently been lower in people of African ancestry than in European individuals because the lack of data limits clinical geneticists’ ability to diagnose this condition in patients of African ancestry. Similarly, researchers recently showed that a genetic test for cardiomyopathy yielded six false positive results among approximately 230 patients of African ancestry because the variants in question were more common among healthy control individuals of African descent than among healthy control individuals of European descent.
Of course, the availability of sequencing data from diverse human populations depends on the regular deposition of DNA sequences from individuals of different populations in shared and open databases. However, there can be several impediments to open sharing of such data. First, while most DNA sequencing data obtained these days are from research programs enlisting properly consented individuals, it is estimated that by 2022, more than 80 percent of human genome sequence data will ultimately come from health-care systems. This will bring technical challenges in accessing data from disparate sources and respecting regulatory barriers: patient protection is facilitated by anonymizing sequencing data, but when these data are accompanied by demographic and/or phenotypic data, this anonymization is not 100 percent effective. Second, there are inconsistent policies surrounding genomic data sharing. The Fort Lauderdale Agreement supports free and unrestricted use of genomic sequencing data prior to scientific publication. On the other hand, some funding entities restrict genomic data sharing in the interests of national security.
The human reference genome represented a major achievement, and thanks to ongoing improvements, continues to be the foundation for much of biomedical research. Scientists built this reference genome with about two-thirds of the data coming from one individual’s genome—ironically an African American—during a time when it was thought that human genetic diversity was much more limited.
We learned a tremendous amount from sequencing and assembling that first-ever human reference genome, and technical improvements in sequencing technology subsequently made it easier to determine the DNA sequences of many additional individuals. As we obtained more whole-genome sequence data, we realized that individual genomes differed much more than we initially appreciated. Indeed, the sequencing and de novo assembly of a Korean individual was used to fill more than 100 gaps in the human reference genome and identify many variants specific to people of Asian descent.
As more genomes from different populations are sequenced, population-specific reference genomes are subsequently being developed. For example, scientists in Japan have integrated three newly assembled genomes into a population reference with detailed analyses of genetic variants. These, and other new genome assemblies, will serve as excellent foundations on which to establish individualized medical treatments for those populations.
Diversifying types of genetic variants
Just as we must look at genomes broadly across populations, so too must we look broadly at variant types within and between world populations. Structural variants (SVs)—which include deletions, duplications, insertions, inversions, and translocations—are important for precision medicine, but our ability to identify classes of SVs has consistently lagged behind our ability to identify the easier-to-find single nucleotide polymorphisms (SNPs).
Fortunately, that’s starting to change. For example, my team previously worked with scientists on the 1000 Genomes Project. Its goal was to catalog 99 percent of the SNPs seen in at least 1 percent of 26 world populations (6 populations from the Americas—including 2 African-descendant populations—5 populations from Africa, 5 European populations, 5 South Asian populations, and 5 East Asian populations). In total, the project sequenced and analyzed the genomes of 2,504 individuals (about 100 genomes from each of the chosen populations) and identified more than 88 million variants, including 60,000 SVs.
As we further understand the extent of human genetic diversity, it will be critical that more researchers consider this variation when designing studies to test and model human diseases. Experimenting on cells from an individual of one ethnic background limits the conclusions to individuals of that population. Establishing a standardized and well-characterized panel of induced pluripotent stem cell lines, made from individuals representing myriad populations, could be an excellent resource for recapitulating human genetic diversity in future experimental work. The use of differentiated cells from genetically diverse individuals could help us determine which populations a particular conclusion applies to.
Similarly, when conducting in vivo experiments, researchers should use animals with different genetic backgrounds to more fully represent genetically diverse humans, their complex traits, and their susceptibility to diseases. In mice, the “collaborative cross” and the “diversity outbred” panels (developed in part by, and available from, the Jackson Laboratory) are being used to show a wide variety of phenotypic outcomes for a given mutation depending on genetic background, and can be a powerful platform for identifying mutation modifiers.
The lack of diversity in our databases and genomic studies should be considered a problem for the entire life-science community, and together we must strive to address it—not just by sequencing more people of different backgrounds, but by taking such variation into account in all our future research. I urge scientists to face the challenge and consider how they can contribute to a more comprehensive solution that will prevent persistent inequality in the availability of individualized medicine.
Charles Lee is director of the Jackson Laboratory for Genomic Medicine in Farmington, Connecticut. He is also a distinguished professor at Ewha Womans University in South Korea as well as at the First Affiliated Hospital of the Xi’an Jiaotong University in China. Lee is currently president of the Human Genome Organization.