K. YUAN ET AL., GENOME BIOL, 16:36, 2015 By the time a person arrives at the doctor’s office with a tumor, a lot has already happened at the cellular—and genomic—level. That cancer sprang from one mutant cell that spawned a mass of cells with additional nucleotide changes and diverse phenotypes.
To understand the evolutionary history of a cancer, scientists are turning to single-cell sequencing. Conceptually, building a tumor’s family tree is fairly simple: “Cells that have mutation A come before cells that have mutation A and mutation B,” explains Aaron Diaz, a glioblastoma researcher at the University of California,...
Each tumor will yield its own unique tree shape, notes Nicholas Navin, a genomicist at the MD Anderson Cancer Center in Houston. A long trunk with short branches, for example, would indicate the cells underwent genetic changes early on, then stabilized, yielding a fairly homogeneous tumor. In contrast, a short trunk with plenty of branching would suggest the tumor diversified continually after the cells first became cancerous.
The method of constructing tumor trees is akin to evolutionary biologists’ strategies for constructing phylogenetic trees, and indeed, cancer biologists have borrowed computational tools from that field. However, single-cell sequencing brings its own special considerations. The major issue is that plenty of errors appear in single-cell data due to problems in amplifying or sequencing the individual genomes. Sequencing mistakes create false positives that look like a mutation, or obscure a true mutation. False-negative rates, in particular, can be high, sometimes topping 10 percent (Genome Biol, 17:86, 2016). In addition, some portions of a cell’s DNA might be overrepresented; in other cases, genotypes may be missing.
Scientists working with these noisy data need specialized tools. In recent years, cancer researchers, computer scientists, and mathematicians have joined forces to design a handful of new algorithms customized for tumor studies. Some produce a phylogenetic tree, with each leaf representing a sequenced cell. Others collapse cells with similar sequences into clones, generating a tree that represents how those clones evolved. Still other trees display the individual mutations in the order they likely occurred. It’s still early days, however, and there’s certainly room for improvement in these tools, cautions Sohrab Shah, a senior scientist at the British Columbia Cancer Agency in Vancouver.
One factor that differentiates such algorithms is whether they make what’s known as the infinite-sites assumption. This means they allow each genetic locus to mutate only once within a tumor, and no more, and never revert back to the wild-type sequence. While it seems like such multiple mutations should be unlikely, computational biologist Niko Beerenwinkel of ETH Zurich in Basel, Switzerland, recently posted a preprint to bioRxiv suggesting that recurrent mutations can and do occur, and not infrequently (bioRxiv, doi:10.1101/094722, 2016).
Here, The Scientist profiles four options to generate tumor trees based on single-nucleotide variations in single-cell sequences.
BITPHYLOGENY
Authors: Niko Beerenwinkel, Associate Professor, ETH Zurich, Basel, Switzerland; Florian Markowetz, Senior Group Leader, University of Cambridge, U.K.
Input: Users give the program a matrix, with cells in rows and mutations in columns. For each spot in the matrix, they indicate a ‘1’ if the mutation is present in that cell, a ‘0’ if not. Missing data points are also allowed.
Method: BitPhylogeny uses an algorithm called Markov Chain Monte Carlo to search for the tree that would best match the data in the matrix. At the same time, it groups cells with similar genotypes into clones. The program can infer that certain clones must have existed in the past, even if they aren’t in the tumor sample, because they are the likely last common ancestor of cells it’s analyzing. It also computes the length of the branches on the tree—that is, how much relative time passed between mutation events (Genome Biol, 16:36, 2015).
Error approach: As it’s developing possible trees, the program also estimates the error rate from the data matrix.
Output: BitPhylogeny builds a clonal tree, with each node being a group of genetically similar cells.
Pros
- In addition to single-nucleotide variants, BitPhylogeny can also analyze data on methylation of DNA, and in this case does not have to use the infinite-sites assumption, says Markowetz.
- It gives users information on the probability of individual branches occurring. Low-probability branches could have arisen from sequencing mistakes, says Beerenwinkel.
Cons
- For single-nucleotide variants, BitPhylogeny will use the infinite-sites assumption.
- Markov Chain Monte Carlo methods are slow; BitPhylogeny takes hours to generate a tree, depending on the complexity of data fed in.
- It does not distinguish between cells that are hetero- or homozygous for a given mutation.
SCITE
K. JAHN ET AL., GENOME BIOL, 17:86, 2016 Author: Niko Beerenwinkel, Associate Professor, ETH Zurich, Basel, Switzerland
Input: As with BitPhylogeny, users provide a matrix of single cells and mutations.
Method: BitPhylogeny was inefficient, so Beerenwinkel and colleagues designed SCITE to be faster and more robust. It also uses Markov Chain Monte Carlo methods to search through possible trees for the one most likely to result from the input data, though compared to BitPhylogeny, SCITE offers fewer parameters users can tweak (Genome Biol, 17:86, 2016).
Error approach: SCITE estimates the error rate from the data as it creates the tree, assuming the rate of false positives and false negatives are the same for all mutations.
Output: The program can create a phylogenetic tree, with each leaf representing a cell, or a mutation tree that lays out the order in which mutations occurred. It can decide which to produce based on the data input. For a large number of cells—say, hundreds—it’s more efficient to make a mutation tree, explains Beerenwinkel; a phylogenetic tree is more efficient if the program has to deal with hundreds of mutations. Whether a given data set contains high numbers of mutations depends on the mutation rate in the tumor itself as well as which technology was used to obtain those sequences, says Beerenwinkel. Users can specify which kind of tree they’d prefer.
Pro
- SCITE is efficient. So far, the researchers find it takes only a few minutes to generate a tree from 60–100 cells, though the timing will depend on the complexity of each data set.
Cons
- The original SCITE algorithm makes the infinite-sites assumption, so it will be inaccurate if the same mutation happened independently in multiple cells within a tumor, and it assumes that any homozygosity for a mutation is due to a technical error. But in their recent bioRxiv paper on multiple mutation rates, Beerenwinkel and colleagues described an extension to SCITE that would allow for recurrent mutation.
- SCITE does not take into account mutations that are present in all cells or only in one cell.
OncoNEM
Authors: Florian Markowetz, Senior Group Leader, University of Cambridge, U.K.; Edith Ross, Graduate Student, University of Cambridge
Input: Users input a matrix of cells and mutations along with the error rates in the sequencing data, if known.
Method: OncoNEM clusters individual cells into clones of similar genotypes. It uses what’s called a neighborhood search to try out different trees and find one that would likely yield the observed sequencing data. Like BitPhylogeny, it also infers the existence of clones that were likely present in the tumor at some point. However, the method is not too different from SCITE’s approach, says Markowetz. “SCITE and OncoNEM are really, under the hood, the same model,” he says (Genome Biol, 17:69, 2016).
Error approach: If users don’t input the error rates, OncoNEM can infer them from the data.
Output: Standard output is a clonal tree, but users can also ask the program to skip the clone-clustering step and produce a phylogenetic tree with single cells at leaves or branch points.
Pro
- It’s fast for a relatively small number of cells. The data sets Ross has tried, including dozens of cells, so far yield a tree within minutes.
Cons
- It makes the infinite-sites assumption, so it can’t take into account reversions to wild type or multiple instances of the same mutation.
- It does not distinguish between cells that are hetero- or homozygous for the same mutation.
SiFit
H. ZAFAR ET AL., bioRxiv, doi:10.1101/091595, 2016Authors: Luay Nakhleh, Chair, Department of Computer Science, Rice University, Houston, Texas; Nicholas Navin, Associate Professor, MD Anderson Cancer Center, Houston
Input: The program can analyze either the full single-cell sequences or a matrix of single-nucleotide variants, along with error rates and the probability of a mutation across the genome. In addition, users can input data that distinguishes between cells that are heterozygous or homozygous for each mutation.
Method: Hamim Zafar, a graduate student at Rice, programmed SiFit with a maximum likelihood model to search through trees for the one most likely to explain the given data (bioRxiv, doi:10.1101/091595, 2016).
Error approach: SiFit can estimate error rates from the data, if users don’t input it.
Output: SiFit builds a phylogenetic tree with each leaf representing a cell, but can also convert this into a mutation tree.
Pro
- Nakhleh and colleagues designed SiFit to avoid the infinite-sites assumption, so it allows for multiple occurrences of the same mutation, or reversion of a mutation to wild type.
Cons
- SiFit is computationally demanding, and never really “finishes” its analysis. “You can run it for days and weeks and months and you still have not explored every possible tree,” says Nakhleh. Users typically set it to run for a given time, say, six or eight hours.
- Nakhleh is thinking about how to calculate and display the probabilities that an individual branch of the output tree is correct, which the program can’t yet do, so scientists know how confident to be of each scenario.