NATIONAL HUMAN GENOME RESEARCH INSTITUTE/WIKIMEDIA COMMONS
If there’s anything the burgeoning field of single-cell biology has revealed in the past few years, it’s that each cell is unique. Even cells of the same type can vary significantly in their complement of expressed genes. “We sort of knew this, but we now know it in spades,” says James Eberwine, codirector of the Penn Program in Single Cell Biology at the University of Pennsylvania’s Perelman School of Medicine.
That observation vindicates the monumental efforts of teams of biologists and bioinformaticians to study single cells. On the other hand, it also makes single-cell studies—especially those tackling transcriptomes—more daunting. Which differences between cells result from biological rather than technical variation? How many cells do you need to study to be able to know for certain?
Researchers are now able to answer these questions with more confidence. Single-cell RNA sequencing (scRNA-seq) is progressing on many fronts, including refining those tenuous steps of amplifying picogram amounts of RNA and generating cDNA libraries for high-throughput sequencing. New data-analysis tools, adapted for single-cell data, are coming online as well.
The Scientist asked protocol developers and experienced sequencers for advice on generating libraries and on processing and analyzing the resultant data.
Isolation is still one of the toughest steps in single-cell biology. The strategies researchers use fall into a few main categories: manual picking, fluorescence-activated cell sorting, or microfluidics. The Scientist has covered isolation techniques in earlier Lab Tools articles. (See “Pushing the Limits,” February 2015, and “Singularly Alluring,” June 2014.)
PICKING A PROTOCOL
The protocol you choose for generating a cDNA library depends on your main goals. If you want to study overall variability in transcription of cells within or across different tissues, you need a large number of cells. (Hundreds, though the number depends on many factors, including the depth at which you sequence, Eberwine says.) In contrast, if you want to look at a few genes associated with a specific process, such as cell death, you can get by with fewer cells. Whether you are studying cells from multiple animals is another consideration; you need more cells to tease out individual donor effects, according to Eberwine.
Some protocols are now allowing the generation of sequencing libraries for thousands of cells, albeit at lower read depth, though sequencing costs can quickly add up even in these situations. You will probably need to sequence in greater depth to quantify the expression of low-abundance genes or to capture the overall variability of transcriptomes, Eberwine adds.
New protocols for generating sequencing libraries come out all the time, and head-to-head comparisons are hard to come by. A recent study of gene expression in mouse embryonic stem cells, led by biologist Wolfgang Enard of the Ludwig-Maximilians University Munich, compared the sensitivity, accuracy, and precision of a handful of protocols—Smart-seq, CEL-seq, SCRB-seq, and Drop-seq—and we asked others to weigh in on these as well.
Switching mechanism at 5′ end of the RNA transcript (Smart-seq) is one of the only sequencing protocols that allows you to generate full-length coverage of transcriptomes from single cells, which is important if you’re studying allele-specific gene expression or splice variants.
Fluidigm’s C1 system is encased in a benchtop machine that automatically orchestrates the steps of Smart-seq, taking your cell suspension and isolating and lysing cells, reverse transcribing their mRNA, and amplifying the resulting cDNA. It requires nonreusable microfluidic array chips that analyze 96 cells. Amplified cDNA is then sequenced or detected using qPCR.
Smart-seq on the C1 was the most sensitive in Enard’s comparison, but also the most costly. One upside to the technique is that you can put the arrays under the microscope to verify that healthy single cells occupy the wells, says Aleksandra Kolodziejczyk, a graduate student in the lab of Sarah Teichmann at the European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI).
Cell expression by linear amplification and sequencing (CEL-seq) is a popular protocol that employs in vitro transcription, a type of linear amplification process, in the early steps as an alternative to PCR, which yields exponential amplification. One benefit of linear amplification is that it is less error-prone than PCR, though both amplification methods come with biases that depend on sequence.
Described in 2012, CEL-seq involves separating single cells (in the case of this 2012 study, manually), reverse transcribing the mRNA fragments that have poly-A tails, and giving these a barcode unique to their cell of origin (Cell Reports, 2:666-73). MARS-seq is another similar protocol (Science, 343:776-79, 2014).
Generating libraries using CEL-seq and other linear amplification protocols takes slightly longer because the in vitro amplification step is 13 hours long. On the other hand, in CEL-seq, samples are barcoded and therefore pooled early on, which cuts back on handling times. PCR is used in the final steps, but more as a means to attach the right sequencing adapters, says CEL-seq developer Itai Yanai, now at New York University.
All the reagents are readily available, and it takes about two days to generate sequencing libraries and sequence data, Yanai says. One caveat is that, like other protocols, it sequences the 3’ end of transcripts. Enard, whose group did not try CEL-seq but used data from the technique in their comparison of scRNA-seq protocols, found that it is the most reproducible.
Bioinformatics tools for CEL-seq are available via GitHub. Yanai’s team is working on a new version, called CEL-seq2, which will be three times more sensitive than the original.
Developed by researchers at the Broad Institute, single-cell RNA barcoding and sequencing (SCRB-seq) uses PCR for amplification and requires access to a fluorescence-activated cell sorting (FACS) machine or another method of efficiently getting individual cells into wells.
The protocol is similar to Smart-seq, except that it incorporates cell barcodes specific to each well (which allows for early pooling of the samples) and unique molecule identifiers, or UMIs, in order to distinguish amplified molecules from the originals and thus to more accurately quantify transcripts. Unlike Smart-seq (and similar to CEL-seq), the approach enriches the 3’ ends of RNA rather than generating full cDNA profiles.
The steps of SCRB-seq are found in a 2014 BioRXiv paper (doi.org/10.1101/003236). Following the upload of that protocol, developers updated and rebranded it as “high-throughput eukaryote 3’ digital gene expression,” which is still offered by Broad as a service and has been incorporated into WaferGen Biosystems’s scRNA-seq platform. Fluidigm’s C1 will also soon enable the SCRB-seq protocol. For DIYers, access to a FACS facility is the main barrier. SCRB-seq is Enard’s current favorite, in part for its versatility: he can use the same protocol to do bulk RNA sequencing.
Two independently developed microdroplet-based methods, Drop-seq and inDrop, are newcomers to the single-cell RNA-seq toolbox. (See “Gene Expression in a Drop,” The Scientist, August 2015.) The techniques, which isolate cells in nano- or picoliter aqueous droplets within oil, allow researchers to equip cells with barcoded primers for amplification and survey thousands of cells.
Enard finds that Drop-seq detects less than half as many genes per cell compared with Smart-seq/C1, CEL-seq, and SCRB-seq. However, in a calculation of the costs needed to detect differentially expressed genes with a specific level of statistical power, Drop-seq and SCRB-seq offered the most bang per buck, he found.
Based on user feedback, version 3.1 of the “living protocol” of Drop-seq came out in late 2015 and is available on Steve McCarroll’s lab website (mccarrolllab.com/dropseq/).
Drop-seq can be up and running within six months, says Stefan Semrau of the Leiden Institute of Physics in The Netherlands, who is a coauthor on the paper describing SCRB-seq. When he moved from the Whitehead Institute for Biomedical Research in Cambridge, Massachusetts, to Leiden, he chose to set up his new lab with Drop-seq because he doesn’t have easy access to FACS. The hardest part is creating the microfluidic chip and getting the oil-water emulsion just right. A postdoc in his lab who specializes in microfluidics got it running in only three weeks. The library preparation is standard for any molecular biologist. Overall, plan to spend roughly six months setting up Drop-seq and optimizing it, he says.
|FOUR SINGLE-CELL RNA-SEQ TECHNIQUES|
|PolyA-mRNA capture||With PCR primer||Oligo-dT primer with Illumina 5’ adaptor, cell barcode, UMI, and T7 promoter||Oligo-dT primer contains cell bar code and UMI||With cell barcode and UMI (primers immobilized on beads and captured along with single cells in droplets)|
|Reverse transcription||Reverse transcription; template switching oligo added to 5’ end of cDNA||Reverse transcription and template switching||Reverse transcription and template switching||Reverse transcription and template switching|
|cDNA amplification||PCR, full-length||In vitro transcription from T7 promoter||Single-primer PCR||PCR|
|Fragmentation/library prep||Tagmentation using Tn5 transposase; Nextera primers added to ends||Fragmentation, followed by PCR to attach Illumina adapters||Modified fragmentation-based approach enriching for 3’ ends; modified Nextera prep||Tagmentation of cDNA with Nextera XT|
|Unique Molecular Identifier (UMI)||No||Yes||Yes||Yes|
|Transcript coverage||Full-length||3' selection||3' selection||3' selection|
|Number of cells||96||96 (on Fluidigm’s C1)||96 or 384||1,000s|
|Efficiency||Most efficient||Most efficient|
|*C. Ziegenhain et al., “Comparative analysis of single-cell RNA-sequencing methods,” BioRxchiv, dx.doi.org/10.1101/035758, 2016.|
DEALING WITH NOISE
How do you know that the gene expression variability across cells isn’t due to technical noise? Weeding out such noise is still one of the major challenges facing even the most seasoned single-cell experts. Many factors generate noise, including incomplete lysis of cells, variability in the reagents, or inefficient reverse transcription. But in general, “I don’t think people will know at this point where the noise comes from,” Semrau says.
One way researchers deal with this problem, from the wet-lab side, is to pick protocols that use UMIs, such as CEL-seq, SCRB-seq, and Drop-seq. Counting UMIs rather than reads can cut technical noise in half (Nature Meth, 11:637-40, 2014).
Another strategy is to use commercially available reference mRNAs, namely ERCC Spike-In Control Mix (Thermo Fisher Scientific). These are preformulated blends of RNA fragments of known abundances, developed by the National Institute of Standards and Technology’s External RNA Controls Consortium (ERCC), an ad-hoc group of academic, private, and public organizations.
The ERCC mix allow researchers to quantify technical noise. These controls are not perfect—they are not spiked directly into the cell, and protocols may differ in their ability to efficiently lyse the cell. Dominic Grün, a quantitative biologist at the Max Planck Institute of Immunobiology and Epigenetics, has found that validating expression levels using single-molecule fluorescence in situ hybridization (FISH) has revealed some discrepancies in ERCC quantification. He does not recommend ERCCs for measuring absolute levels of gene expression, but he still thinks they are important to include for relating individual genes to transcriptomes.
THE DATA DEEP DIVE
The initial steps of processing single-cell RNA-seq data look just like those used in bulk RNA sequencing. However, you get a very large batch of data for each single cell. Most people will naturally tend to zero in on their favorite genes, but that doesn’t do the data justice, Eberwine says.
If you’re a molecular biologist by training, to get the most out of your data you should get somewhat familiar with the computing language R. Most of the various ways you can digest your data use R. Formal courses may not be sufficiently tailored to your needs, so Kolodziejczyk recommends just jumping in. (A molecular biologist by training, she learned R by sitting next to generous postdocs and by Googling.)
New tools come online all the time, and in the next few years some favorites will surely emerge. There’s no one-size-fits-all solution: the current crop of analysis tools addresses a range of different questions in RNA-seq, such as studying single cells as they differentiate, or categorizing cell types by using various types of clustering analyses. Papers and conferences should be your first stop when shopping for the right ones, Semrau says.
SOME DATA-ANALYSIS CHOICES:
Single-cell studies of transcriptomes are almost always a team effort. “There are very few labs that have all the technical expertise in the lab to create new microfluidics devices, to do the genomics, to do the informatics,” Eberwine says.
Your protocol picks may well depend on what sort of equipment or expertise your neighbors have, Kolodziejczyk says. But your particular project should determine the allocation of manpower to each step. Who will generate and prepare samples, analyze data, and conduct any necessary follow-up studies pursuing function? Find a collaborator who’s going to complement your gaps in training, whether that’s in the wet or dry lab.
For a single study it’s not worth the effort to set up single-cell RNA-seq, Enard says. Some core facilities offer the entire sequencing process as a service, or you can elect to outsource one or a few of the many steps.
Several places offer once-a-year short courses on single-cell sequencing, including Cold Spring Harbor Laboratory in New York; the European Molecular Biology Laboratory in Heidelberg, Germany; the University of Pennsylvania; and the Wellcome Trust Sanger Institute near Cambridge, U.K.