FLICKR, SHAURY NASH An international team led by researchers at the Broad Institute of MIT and Harvard has compiled and analyzed the largest aggregate collection of human protein-coding sequences to date. The researchers, members of the Exome Aggregation Consortium (ExAC), have made these raw data openly accessible to the research community since 2014. In the team’s latest analysis of the exomes from around the world—presented in part at a genomics conference in 2015—the team highlighted the utility of the large dataset to identify rare disease–causing variants and genes that are particularly sensitive to mutational variation, including loss of function. The results are published today (August 17) in Nature.
“The important part of the work is the large number of [exomes],” Stephen Scherer, who studies variation in the human genome at the Hospital for Sick Children and the University of Toronto, Canada, but was not involved in the work,...
“This is the deepest anyone has gone for any substantial part of the [human] genome,” said Jay Shendure of the University of Washington in Seattle, who penned an accompanying perspective but was not involved in the research.
The protein-coding sequences—which comprise less than 2 percent of the entire human genome—“are the parts of the genome we understand the best and they are also the regions of genome where the vast majority of severe disease causing mutations are found,” study coauthor Daniel MacArthur of the Broad Institute told reporters during a press briefing.
MacArthur and colleagues pooled exome data contributed by researchers from more than two dozen disease-specific projects, creating a list of more than 7.4 million genetic variants from 60,706 individuals—10-fold larger than any prior exome database. The information took up almost a petabyte of storage (an aggregate of 4,000 laptops worth of raw data, according to MacArthur).
“Many of these projects were directly studying common human diseases but had variable success which points to the fact that data can have uses besides its intended purpose,” Shendure told The Scientist.
The team, led by Monkol Lek, a research fellow in the MacArthur lab, found variants spaced around every eight base pairs, on average, within regions of the genome that are particularly prone to variation. The researchers often captured the same variant over and over, suggesting that the dataset is large enough that variants within these regions were becoming saturated. While the dataset is not large enough to see every possible genetic variant, at these particular sites, the team was able to capture about 63 percent of all possible synonymous variants. “I find that exciting, as it previews what is in store in the long future trajectory of this field as we sequence millions of human genomes,” said Shendure.
The large number of exomes allowed the researchers to find that 183 of 192 allelic variants previously categorized as pathogenic (but found at a relatively high frequency in the ExAC database) are likely benign.
The team also identified 3,230 genes that are particularly intolerant to mutation even when the second copy of the gene is wild-type. Seventy-two percent of these genes have not been linked to any known disease, demonstrating the ability of data from apparently healthy individuals to reveal genes that—when mutated—may contribute to disease.
“ExAC has taken exome sequenced samples from several different projects of control and common disease cohorts but, by reanalyzing and recalling variants of the entire dataset together, it has produced consistent and accurate frequencies of rare variation,” Roddy Walsh, who studies the genetics of inherited heart disease at Imperial College London and has used ExAC data, wrote in an email to The Scientist.
Separately, Walsh and colleagues from Imperial College London and the University of Oxford, U.K., have demonstrated the utility of the amassed genetic information for evaluating genes involved in multigenic heritable diseases. The team compared ExAC data to 7,855 clinical cardiomyopathy cases, finding that many putative cardiomyopathy-linked genes are unlikely to contribute to the disease. “By focusing on confirmed genes, we expect this study to improve the clinical genetic testing of cardiomyopathies by reducing the number of uncertain and even false positive results,” wrote Walsh. The team’s results appeared in Genetics in Medicine today.
In a third publication, Douglas Ruderfer from the Icahn School of Medicine at Mount Sinai in New York City and colleagues analyzed the patterns and frequencies of copy-number variants (CNVs) found within ExAC data. CNVs, gains or losses of sequences at least 1000 base pairs in length, are more rare and difficult to detect compared with single-nucleotide changes or indels (insertions or deletions). About 70 percent of the exomes contained at least one rare CNV, the researchers reported in Nature Genetics today; individual exomes, on average, contained 0.81 deleted and 1.75 duplicated genes.
The ExAc dataset currently represents individuals of African/African-American, Latino, East Asian, European, and Southeast Asian descent. For MacArthur, one future goal is to fill gaps in the underrepresented populations, including those from the Middle East and parts of Africa.
“The project also aims to release a new exome data set with approximately twice the number of individuals at this year’s American Society of Human Genetics meeting in October,” Lek wrote in an email to The Scientist.
M. Lek et al., “Analysis of protein-coding genetic variation in 60,706 humans,” Nature, doi:10.1038/nature19057, 2016.
D.M. Ruderfer et al., “Patterns of genic intolerance of rare copy number variation in 59,898 human exomes,” Nature Genetics, doi:10.1038/ng.3638, 2016.
R. Walsh et al. “Reassessment of Mendelian gene pathogenicity using 7,855 cardiomyopathy cases and 60,706 reference samples,” Genetics in Medicine, doi:10.1038/gim.2016.90, 2016.