© JAROSLAW WOJCIK, ISTOCKPHOTO.COM/ERIN LEMIEUX
An important goal of personalized medicine is to be able to create individual lifetime health plans that are targeted to his or her unique genetic makeup. In recent years, technological advances have provided the tools to make wide-spread, affordable whole-genome sequencing possible—but in doing so have revealed just how unique those individual genomes can be. Two recent studies involving the deep sequencing of human exomes and drug target genes in more than 16,000 individuals, for example, clearly demonstrated that rare variants—those that occur in less than 0.5 percent of the population—are, as a group, quite abundant and have functional impact (Science, 337:64-69, 2012; 337:100-104, 2012). The studies showed that each person carries more than 10,000 variants—one in every 17 bases—of which at least 300 can be predicted to affect protein function. And in late October, the 1000 Genomes Project published a comprehensive map of human genetic variation based on the whole genome sequences of 1,092 volunteers from Europe, East Asia, Africa, and the Americas (Nature, 491:56-65, 2012). These data further demonstrate the importance of rare variation in health and disease by showing that common variants are shared across populations, while rare variants are unique to specific ethnic groups.
Our new appreciation for this uniqueness has opened our eyes to the shortcomings of the tools and methods that are commonly used for measuring and interpreting genetic variation. Indeed, many of the studies that were done over the past decade to identify and measure the effects of genetic change were carried out using tools that were created with the assumption that genetic differences are rare. The most common tools are all based on a single human reference genome sequence that was put together nearly 10 years ago. Although several improvements have since been made, this single sequence, which fails to capture the nuanced variation of “normal,” is still used as the standard reference to which all other sequences are compared.
The vast majority of gene expression and genome-wide association studies (GWAS), for example, are based on microarray assays that were designed using this single human reference sequence. To comb for associations between individual genetic changes and observable traits, GWAS use DNA probes, short pieces of DNA attached to glass slides, that are based on the reference sequence. Since it is commonly thought that only a single change is likely to take place at a given position in a DNA sequence, the assay typically uses just two probes to measure the genotype. If an individual has a different base at the measured position, different bases nearby, or the probe is in a region of local duplication, that person’s genotype cannot be measured and is reported as inconclusive. Indeed, earlier this year one of us (Smith) and colleagues showed that half of sequences assayed by these microarray probes contain previously unknown variation or lie within structural variants, resulting in “missing” genotypes. These missing genotypes contribute in part to challenges in interpreting and reproducing GWAS results (PLOS ONE, 7:e40294, 2012).
Microarray assays are not the only measurement tool that suffers from an overly simplistic view of genetic variation. Another example is transcriptome analyses, in which sequences are determined for all the mRNAs in a cell. Each partial mRNA sequence is identified by comparison to the human reference sequence. Similar to microarrays, if the sequence in the sample is too different from the reference, or aligns to a repetitive sequence, it cannot be unambiguously identified. As more data are collected, the limitations of a single reference are becoming clear.
Additional problems also appear in bioinformatics analyses informed by a limited view of variation. We observed that a large number of variants reported for common DNA samples in both the 1000 Genomes and Complete Genomics data sets were unique to one data set or the other. This finding likely resulted from differences in data-collection and data-processing approaches. And these issues go beyond problems with multi-allelic genotypes and single nucleotide changes. Bioinformatics studies also rely on measures of linkage disequilibrium (LD)—the likelihood that two alleles will be inherited together, given their proximity in the genome. If these estimates of LD are off, then the genotypes predicted from studies with small numbers of samples are unreliable.
Because we have for so long underestimated individual variation in the human genome, it is clear that many of the conclusions from previous studies need to be revisited. The challenge ahead is how to use this new understanding to inform new types of DNA sequencing-based analyses. Achieving the long-term goals of personalized medicine will require better reference resources and standardized methods for using them.
Choosing standards that reflect diversity
Our immediate task is to develop new standards for large-scale DNA sequencing. For example, the National Institute of Standards and Technology (NIST) has initiated efforts to standardize DNA samples to compare next-generation sequencing (NGS) techniques and other genotyping assays. Above all, these standards must include the development of better reference tools. Should there be multiple reference sequences? Should each individual’s genome be sequenced and used as a personal reference? To serve this goal and help answer these questions, the Genome Reference Consortium, a multi-institutional collaboration, continues to fill in gaps and develop alternative sequences in regions of high variability. In other efforts, the US Food and Drug Administration–led SEquencing Quality Control Consortium (SEQC) is working to standardize assays that depend on these reference materials. Specifically, SEQC focuses on NGS-based assays that measure gene expression, where DNA sequencing is used to count the number of RNA molecules transcribed from a given gene.
Unfortunately developing robust standards is not the highest priority for the National Institutes of Health (NIH). Instead, the NIH and other funding agencies continue to invest in high-profile projects that emphasize data collection over infrastructure development. The ENCODE project is one example. An enormous quantity of data was produced from a limited number of human cell lines that have not had their genomes sequenced. All functional annotations are thus inferred from a reference genome that is very different from the genomes of the immortalized cell lines that were used. While such projects build insights and discover new biology, they still leave us with tools that fail to take into account the incredible diversity of human genomes.
For personalized medicine, a diagnosis based on an incomplete understanding of genomic variability can result in the use of drugs that lack efficacy. Missing rare but critical patterns of variation can also lead to undiagnosed rare conditions, with a lifetime of cost and pain for affected individuals. Improving both situations requires better and more complete ways to measure individual variation. Without a stronger genomics foundation, having one’s genome read will fail to accurately inform a person’s lifetime health plans.
Todd Smith is a senior leader of research and applications at PerkinElmer in Seattle, Washington. Sandra Porter is the president of Digital World Biology, also in Seattle, where she develops educational materials that use bioinformatics to teach biology.