Phase 1 of the International HapMap Project (http://www.hapmap.org), published in November 2005, was hailed by the mainstream press as a revolutionary tool for gene-association studies. Researchers using the data have been similarly enthusiastic. Says Jeanette McCarthy of the Graduate School of Public Health, San Diego State University, "It's an unprecedented resource. ... It provides a lot more information not just for people doing whole genome association studies, but [also] for those focused on specific regions of the genome or even candidate genes. It can add a lot of information and help us pinpoint the genes a lot easier."
Left out of the discussion, however, are more practical issues. Like any map, the HapMap requires some training to use properly. How, for instance, do you use the data? Are there things to look out for when choosing SNPs (single nucleotide polymorphisms) and determining haplotype block...
1. DON'T USE HAPMAP FOR RARE VARIANTS
At the core of the HapMap is the concept of linkage disequilibrium (LD), a numerical measure of how well a SNP in a particular location correlates to a SNP in another location nearby; high LD between two SNPs means that the presence of an allele at one SNP is a good predictor of the other SNP's allele. So-called htSNPs (haplotype-tagging) are SNPs with a high degree of LD for every other polymorphism located in a haplotype region or block; in other words, htSNPs can be used as a surrogate marker for every SNP within that particular block. Others consider tagSNPs, which can be determined using pairwise LD (e.g., r2>=0.8), without reference to blocks.
In gene-association studies, users generally pick five or six SNPs with a high degree of LD for a known variation, making the assumption that if these SNPs tag well for that variation, they will also tag well for another, unknown variation associated with it. Most of the time this will be a good assumption, but not always, says Oxford scientist Gilean McVean. One case involves rare variation, though McVean says more sophisticated algorithms, yet to be developed, might be able to tag these. "The second is that some SNPs are simply untaggable," he says, "and it's not clear why that's true. They may be hot spots of mutation ..., they may be hotspots of gene conversion, or they might be sitting in the middle of recombination hot spots. And if any of those three is true, then the current HapMap won't tag those."
That's the conclusion Oxford researcher Eleftheria Zeggini and colleagues came to in a recent evaluation of HapMap performance,
2. NOT ALL tagSNPs ARE CREATED EQUAL?
Several tools (see Table) are available to help you pick HapMap SNPs, but not all of them will give you the same answer.
"Really, all they're doing is going through the HapMap data, computing statistical correlations between the different SNPs," says Lon Cardon of the Wellcome Trust Center for Human Genetics. "Whenever they see a high correlation, they pick one of them, and say ... you just need to genotype this one out of this set. But naturally, there's multiple solutions to that problem." Adds Michael Nothnagel of the University of Kiel, Germany (whose own work demonstrated the effect of SNP marker choice on haplotype block patterns
Nothnagel adds that further verification of the LD pattern is also needed. "You can have identical patterns with different SNP sets ... but you cannot assume that this is always the case," he says, noting that in his view, the pairwise method of selecting SNPs is more stable than the block approach.
Cardon suggests that comparisons between two studies using different tags be done at the marker level. "If I wanted to see how my answers compared with someone else's, I [would] genotype the same markers that they did, not necessarily just proxies," he says.
3. DON'T DISCARD YOUR FAILED DATA
Scientists can use the HapMap to identify large insertion deletion polymorphisms and Mendelian inheritance inconsistencies by mining apparent genotyping failures (such as null genotypes or Mendelian inconsistencies) for patterns that are indicative of these chromosomal abnormalities.
Until now deletions have been hard to detect with standard SNP genotyping assays. With the HapMap, though, one can scan the genotyping data for unusual regions that match the appearance of a deletion.
4. DON'T ASSUME HAPLOTYPE BLOCKS HAVE FIXED ENDS
Yale researcher Josephine Hoh and colleagues recently performed a genome-wide screen for SNPs associated with age-related macular degeneration (AMD) in 146 Caucasian subsets and identified a 500-kb region containing two alleles associated with the disease.
Hoh's team assumed the block would contain functional polymorphisms in linkage disequilibrium with the risk alleles. But, after resequencing the exons they found that a functional polymorphism associated with AMD was in fact located 2 kilobases upstream, just outside the HapMap block.
One lesson, says Hoh, is this: Researchers who use HapMap data to fine-tune their disease-association studies must rely on educated judgment when determining block boundaries. "The question will be: Are you going to sequence in just that block of intervals, or do you want to widen the interval a little bit and do a larger size of sequencing?" she says. "After all, the cohorts we are investigating would never be the same as the ones in the HapMap project."
5. VALIDATE, VALIDATE, VALIDATE
Keith Cheng and colleagues at the Pennsylvania State University College of Medicine, along with Penn State anthropologist Mark Shriver, used the HapMap recently in a study of the genetic basis of human pigment variation.
Cheng points out that his team performed three levels of validation: frequency difference between populations, regional evidence of selection, and functional evidence. Although the genomic data clearly provided evidence of selection, Cheng's team performed a second study in two admixed populations (with measured skin pigmentation levels) not included in the HapMap data to confirm their findings. "When you do these sorts of studies, you need to do multiple levels of validation in order to prove that what you assert is correct. I think it's very dangerous to start with any one feature alone, such as just frequency or just population distribution around the world. There can be a lot of artifacts caused by things like bottlenecks," Cheng firstname.lastname@example.org