© DIM TIK/SHUTTERSTOCK.COMA decade ago, scientists studying the human genome found 1,447 copy number variable regions, covering a whopping 12 percent of the genome (Nature, 444:444-54, 2006). Ranging in size from 1 kilobase to many megabases, the number of repetitive DNA sequences scattered throughout the human genome can expand and contract like an accordion as cells divide. Extra—or too few—copies of these repeats, known as copy number variations (CNVs), can explain inherited diseases or, when the copy number change occurs sporadically in somatic cells, can result in cancer. Today, a growing number of scientists are making links between CNVs, health, and disease.
But measuring CNVs in cells from an individual can be tricky. For a number of years, researchers relied on fluorescently tagged microarray probes that attached to sections of genes; locations where the probes fluoresced more or less brightly than in an average genome suggested...
“This seems to be the next wave in CNV calling,” says computational biologist Dan Levy of Cold Spring Harbor Laboratory. “Things are moving from microarray to sequencing-based approaches.”
But sifting through an entire genome to find changes to CNVs is no easy task either, whether you’re starting from a whole genome or an exome.
“Exome sequencing was optimized to detect small things, like single-nucleotide variants or deletions, not for CNVs,” says Yufeng Shen, a computational biologist at Columbia University. “So the data is quite noisy for detecting CNVs.”
To sort through this noise, researchers are developing new computational tools, each of which takes a slightly different approach to finding CNVs. The Scientist spoke with the creators of four recently debuted open-source tools about what distinguishes their approaches and when you should consider using their software.
CANOES: FINDING RARE, SMALL CNV DELETIONS
COURTESY YUFENG SHENTHE PROBLEM: Shen was studying the genetics of heart disease when he decided that the existing tools weren’t working for him.
He wanted to detect small, rare copy deletions in CNV regions that might be related to heart-disease predisposition. Using whole-genome sequencing was expensive, but he didn’t like the messiness of the data that came from cheaper exome sequences.
“Exome sequences are biased toward a reference genome,” he points out. “You’re creating probes based on what you already know you want to sequence.”
Many methods that analyze exome data to detect CNVs rely solely on read depth—essentially, the number of times a section of the genome is read during sequencing. In collecting sequence data, more reads means a more accurate sequence. But in CNV regions, that number of reads, or depth, is also correlated with the number of repeats. More reads usually means an expanded CNV region with extra copy numbers, while a lower depth of coverage than usual can indicate deleted repeats in the CNV region. Shen, though, thought that the inherent bias and noisiness in exome sequences wasn’t being considered appropriately—at a statistical level—by many of the CNV calling programs. So he created CANOES.
THE SOLUTION: By virtue of the inherent variability in sequencing, the read depth on every area of the exome will never be identical. Many CNV analysis algorithms use a Gaussian bell curve to represent the normal distribution of depth of coverage. But Shen’s CANOES program instead relies on a negative binomial distribution, a curve that assumes more regions will have a low depth of coverage, even without changes to CNVs (Nucleic Acids Res, doi:10.1093/nar/gku345, 2014). Changing the curve may seem like a small modification, but Shen says it makes all the difference in being able to use messy exome data to find small changes of just a few repeats to CNV regions.
“We’re acknowledging that the data is very noisy and assuming the CNV of interest is relatively rare in the data set,” he says.
The other key to CANOES is that if you have a particular exome of interest, such as one from a person with heart disease, that you’re comparing with a bunch of controls, it doesn’t automatically use all the controls. Instead, Shen finds reference exomes that have similar read depth to the exome of interest in most areas. That helps avoid false positives popping up simply because of the normal variation in reads.
“What we do is ID samples that, globally, are very similar to the sample of interest,” he says.
HOW AND WHEN TO USE: CANOES is available for free (www.columbia.edu/?~ys2411/canoes), but Shen warns that it’s not for the beginner.
“In order to use it, you have to have a good understanding of the statistical problem of calling CNVs and also know that you’ll have to pick the right reference samples,” he says.
Since the program shines at detecting deletions, he suggests using it at the same time as other methods that might be better at picking up on duplications.
“I’d tell people to choose three tools that all use models so that they are complementary to each other,” he says.
CONSERTING: MOVING BEYOND READ DEPTH FOR CANCER CNVS
COURTESY OF JINGHUI ZHANGTHE PROBLEM: In 2010, computational biologist Jinghui Zhang of St. Jude Children’s Research Hospital in Memphis, Tennessee, was collaborating with colleagues to study the genomes of pediatric cancers. Along with gene mutations, she wanted to find copy number variations that might be linked to some tumors.
“We knew there were other tools available, and we kept running the tools on our data and trying to compare what we got, but to our surprise, the matches between all the tools were very poor,” says Zhang. In retrospect, she says, the tools were developed to find germline CNVs, not somatic changes that arise in cancers. “Tumor analysis is quite different from germline copy number analysis,” she says. “The size of somatic CNVs is more variable, and they can be present in only some cells of a tumor, requiring greater sensitivity to detect.” So Zhang spearheaded her own approach.
THE SOLUTION: Copy Number Segmentation by Regression Tree in Next-Generation Sequencing (CONSERTING) uses more than just read depth to find CNVs (Nat Methods, 12:527-30, 2015). The algorithm also combs the genome for structural variations, such as breakpoints—sequence locations known to be especially sensitive to damage. If two sequences that are normally adjacent are farther apart than usual, that points toward increased copy numbers; if two areas normally spaced apart are now close, that hints at a deletion.
“The structural variation support may not be important if you’re looking at a long deletion,” says Zhang. “But some of these CNV changes are very small and focal, and you might not have statistical power with just read-depth changes.”
In cancer, she says, this is even truer than in germline cells, because tumor cells may harbor diverse mutations. And if only some of the cells in a tumor have a copy number variation, the read depth in that area of the genome may not be obviously different.
HOW AND WHEN TO USE: CONSERTING was developed to compare cancer cells with healthy cells from the same patient. But Zhang says it also could be used in germline samples.
The program is available open-access, but “it’s not easy for most scientists to use,” Zhang says. “We’re looking into whether we can create a graphical interface so a regular bench scientist who doesn’t have bioinformatics expertise can just upload their data.”
SAVING MONEY BY SMASHING THE GENOME
THE PROBLEM: Dan Levy wanted an approach to detecting CNVs that would rival the accuracy of whole-genome sequencing without the cost. “One of the things we’re always trying to do is make things cheaper,” he says. “That allows much larger studies, or you can make clinical products that are more reasonably priced.”
Typically, Levy says, researchers try to drive down the price of genome sequencing by increasing the length of the reads. But when you’re looking for CNVs, long reads don’t help, because “what you’re looking for is an excess of [similar] reads coming from one area of the genome.”
Levy realized that all he needed from each read of the genome was enough information to tell whether it belonged in a region. That gave him an idea for how to lower the cost of finding CNVs.
THE SOLUTION: To detect CNVs using the Short Multiply Aggregated Sequence Homologies (SMASH) method, the genome is first mapped using read lengths of about 150 base pairs, the length that currently gives the lowest cost per base. Levy’s approach then creates random fragments from those reads, each only 35 to 40 base pairs long. Those fragments are joined together into lengths suitable for creating a library—upwards of 300 base pairs—and tagged. Then, the SMASH software uses that library to generate read counts. Each original read length is used more than once, because of fragmentation (Genome Res, 26:844-51, 2016).
© 2016 WANG ET AL., PUBLISHED BY COLD SPRING HARBOR LABORATORY PRESS
“You use long reads to save money, but you use them efficiently,” says Levy. “Instead of finding two things from two reads, we fragment and mix up those reads to get four or five mappings.”
HOW AND WHEN TO USE: Levy says SMASH should work for most applications—germline or somatic CNVs. “The hope is to get some of these CNV tools into clinical applications,” he says. Cheap CNV analysis could be useful for both tumor profiling and prenatal screening, he points out.
The approach is relatively easy for other labs to pick up on, Levy adds. “It’s all using stuff that’s widely available.”
CANVAS: ALL-IN-ONE TOOL
THE PROBLEM: Eric Roller, a bioinformatics scientist at Illumina, wanted a CNV tool that could help Illumina’s broad customer base—researchers who study everything from inherited diseases to cancer using a wide range of data. No existing approaches, he says, met this one-size-fits-all requirement.
COURTESY OF ERIC ROLLER
“CNVs can vary dramatically in size and copy number due to the different underlying biological mechanisms involved,” says Roller. “Extensive CNV heterogeneity means that many tools were designed to detect only particular kinds of variants,” he adds. Many existing tools can either look for CNVs in the germline or in somatic cells but not both, or only use whole-genome data or exome data but not both. And as users have admitted, these customized workflows can be daunting for a beginner, requiring manual inputs of certain control data values and tweaks to the algorithms for each project.
So Roller and his colleagues at Illumina took on these challenges. The result: a program called Canvas.
THE SOLUTION: While the software relies on the relatively common approach of measuring read coverage and allele frequency, where it stands apart is the easy-to-use interface and the machine-learning algorithms that automatically compute things such as coverage bins (collections of similar DNA fragments) so researchers don’t have to do manual calculations. “An analysis can be started with a single command,” says Roller.
To test Canvas’s capabilities, Roller and his colleagues searched for CNVs in a range of data sets, including breast cancer cell lines and previously sequenced genomes. Canvas came out with an 87 percent accuracy score and a 77 percent precision rating when used to analyze somatic cell lines—better in both ways than three older free tools the researchers compared it with. For germline and exome sequences, the program also boasted high accuracy and precision ratings. Moreover, it gave results 2.5 times faster than the next fastest method. The results were reported in Bioinformatics in March 2016 (doi:10.1093/bioinformatics/btw163).
HOW AND WHEN TO USE: “Canvas is a single software tool that can analyze both targeted and whole genome sequencing data from both tumor and germline samples,” says Roller.
The free, open-source software, available from GitHub, runs on Linux or Windows. Roller says he especially recommends it for comparing tumor cells with somatic cells. “This is a case where the data makes copy number calling challenging, due to noisy signal and potentially extensive genome rearrangements,” he points out. “It’s also an important application, as CNVs can be key markers or drivers of cancer.”