RNA sequencing is a popular tool among molecular biologists, because it allows them to examine gene expression patterns in DNA. However, the technique is susceptible to experimental artifacts, which can lead to misinterpreted findings. According to a study published last week (November 12) in PLOS Biology, one such bias, which is associated with gene length, is widespread in many published datasets.
Rani Elkon, a bioinformatician at Tel Aviv University in Israel, says that his team was analyzing RNA sequencing (RNA-seq) datasets for a project aimed at infering the co-regulation of genes by examining their co-expression across many different biological conditions when they stumbled upon a puzzling finding: Genes coding for proteins in the ribosome or other translation-related machinery—which are exceptionally short—and genes coding for extracellular matrix proteins such as collagen—which are exceptionally long—kept popping up in their analyses. “In many different datasets, genes that were upregulated and downregulated were enriched for those specific functions,” Elkon says.
The team wondered whether there was a biological explanation, or if this was the result of a technical glitch. To address that question, they selected 35 human and mouse RNA-seq datasets from GEO, a publicly available genomics data repository. Most of the datasets they chose appeared in papers published between 2017 and 2018 and contained between two and four replicate samples assessing the same biological condition, for example, treatment with tumor necrosis factor, a protein involved in inflammation.
Two quality-control tools could effectively eliminate the sample-specific length biases in the datasets they examined.
Their analysis revealed that extremely short or long genes showed different expression patterns between replicate samples, indicating that were an experimental artifact. If the transcripts were reflecting some cellular activity relevant to the biological condition in question, their abundance should have been consistent for each conditions’ samples. This issue, which the authors refer to as sample-specific length bias, was present in 30 of the 35 datasets. “This indicated to us that the . . . enrichment for the very long and very short genes actually reflects some kind of technical problem in the experiment,” Elkon tells The Scientist.
The researchers also found that the sample-specific length bias increased the number of false-positives in gene-set enrichment analysis (GSEA), a method that is widely used to examine whether genes that show altered levels of expression between RNA-seq datasets correspond to a biological function.
Kaspar Hansen, a biostatistician at Johns Hopkins University who did not take part in the study, says that this sample-specific length bias is actually well-described in the literature. This study shows that despite awareness of the problem—at least among methods-oriented scientists—many researchers are not routinely using existing tools to address it, he adds. “I was surprised by the high percentage of datasets where this bias is an issue.”
Elkon and his team tested whether existing quality-control tools could correct this problem. They found that cqn (conditional quantile normalization) and EDASeq (exploratory data analysis and normalization for RNA-seq) could effectively eliminate the sample-specific length biases in the datasets they examined.
“I think [this paper is] a really nice demonstration of how important it is to do quality control checking,” says Michael Love, a biostatistician at the University of North Carolina-Chapel Hill who wasn’t involved in the study. Love adds that there are other biases affecting RNA-seq data, such as GC-content bias, in which the amount of guanine (G) and cytosine (C) can influence whether the expression level of a gene is over- or under-represented in some samples.
There are other length-related biases as well. In 2009, Alicia Oshlack, a bioinformation who was then at the Walter and Eliza Hall Institute of Medical Research in Australia, and her colleague reported a technical bias inherent to RNA-seq protocols that makes it easier to identify differences in expression in longer genes than in shorter ones. She and her team also developed a method, GOSeq, to address this overrepresentation. Oshlack, who is now at Murdoch Children’s Research Institute, tells The Scientist in an email that while the length biases reported by her group and Elkon’s group are slightly different, they would likely affect GSEA in the same way.
The sample-specific length bias is most likely the result of a technical issue in RNA-seq pipelines, although the exact cause remains unclear, Elkon says. He says he hopes that by shining a spotlight on this problem, other researchers will be made aware of this problem—and take steps to address it.
“I’d say it’s a potentially impactful paper because it is important to know about these things and to think about them when you do your analysis,” says Hansen, who developed cqn, one of the methods tested in Elkon’s study. “Sometimes the community needs good reminders that this is actually an issue.”
S. Mandelboum et al., “Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias,” PLOS Biology, doi:10.1371/journal.pbio.3000481, 2019.
Diana Kwon is a Berlin-based freelance journalist. Follow her on Twitter @DianaMKwon.