ABOVE: CLEANING UP RNA SEQ DATA: Computational methods help researchers reduce noise in the data generated during single-cell RNA sequencing.


The papers

G. Eraslan et al., “Single-cell RNA-seq denoising using a deep count autoencoder,” Nat Commun, 10:390, 2019.

M. Büttner et al., “A test metric for assessing single-cell RNA-seq batch correction,” Nat Methods, 16:43–49, 2019.

In the not-so-distant past, researchers had to pool thousands of cells together for bulk RNA sequencing, yielding an averaged snapshot of gene expression. But advances in technology and significant reductions in cost now enable scientists to sequence RNA from single cells, unleashing a flood of transcription data.

“It used to be that you had to wait for the biologists [to generate data for analysis], but now we are the slow guys,” says Fabian Theis, a computational biologist at Helmholtz Zentrum München in Germany. “There’s just...

One difficulty with single-cell RNA sequencing data is separating meaningful variation from noise. For instance, a gene may appear “silent” because it is not expressed or because its expression was missed for technical reasons. Theis and colleagues try to cut that noise with an artificial intelligence algorithm called a deep count autoencoder (DCA), an artificial neural network that can compress gene expression data into fewer dimensions, distilling the information down to its most important relationships.

To see how well these parameters capture the full picture, the algorithm recreates a full-size dataset and compares it to the original, noting the differences. It repeats this process, stopping once the program doesn’t achieve an improvement after 25 cycles. In two trials with simulated data, the scientists removed some sequences from the database to introduce noise and found that the autoencoder could recover cell-type information that had become masked. In their Nature Communications paper published early this year, they also applied the algorithm to several examples of real transcriptome data, using it in one case to identify the cell types in a blood sample—a task required for various medical and research applications.

Another group published an AI-based tool similar to DCA a few months earlier, which could also recover data hidden by noise and cluster cells into subgroups based on their mRNA (Nat Methods, 15:1053–58, 2018). Compared with more-traditional statistical approaches, the autoencoder techniques are “a more universal approach, quite elegant, where you . . . can let the machine learning take care of fitting all the parameters,” says Peter Kharchenko, a computational biologist at Harvard University who was not involved in this research. Plus, he adds, they’re very flexible and scalable. “The huge advantage of these [models] is that it’s easier to build on these tools.”

In addition to extracting more information from individual data sets, researchers can combine data from different days and augment their pool of sequences with ones from other labs. Considering multiple data sets together allows researchers to view a more complete landscape of cellular biology, says Zhichao Miao, a computational postdoc at the European Molecular Biology Laboratory–European Bioinformatics Institute. However, sequence reads in different datasets are influenced not only by biological variation in the cells being sequenced—differences over time or between treatment groups—but by unintended variation in experimental conditions and in the methodology, such as the sequencing protocol used.

Scientists use statistical approaches to remove such technical noise—so-called batch effects—while trying to capture and dissect biological variation. But there wasn’t a quantitative metric for measuring how “batchy” the data remain, says Miao. So he, Theis, and their collaborators developed a method called k-nearest-neighbor batch-effect test, or kBET, that determines variance in the datasets and scores the different approaches on how well they eliminate batch effects to leave data that are “well mixed.”

The kBET work, published online in Nature Methods last December, is a “good step forward,” says Kharchenko, but new approaches may be required to evaluate batchiness in highly variable datasets, such as RNA sequences from cancer patients and healthy people. “If you consider the difference between samples to be technical variability, then it’s clear what you want to do with it. You want to remove it,” says Kharchenko, whose lab is developing tools for analyzing datasets with more-systematic differences. “If we move to that more challenging setting [of comparing very different groups], then the question itself becomes a little bit less obvious.”

Interested in reading more?

May 2019 The Scientist Issue

Become a Member of

Receive full access to digital editions of The Scientist, as well as TS Digest, feature stories, more than 35 years of archives, and much more!
Already a member?