© EVAN OTO/SCIENCE SOURCE
For 20 years, DNA microarrays have been the go-to method for revealing patterns of gene expression. But as the price of next-generation sequencing continues to fall, RNA sequencing (RNA-seq) has become an increasingly popular method for assessing transcriptomes.
A DNA microarray consists of a predetermined assortment of nucleic acid probes attached to a surface. To assess gene expression, researchers derive complementary DNA (cDNA) from cellular RNA, label the cDNA with a fluorescent marker, wash labeled cDNA over the array, and use lasers to assess how much cDNA has stuck to each probe. RNA-seq relies on converting RNA into a cDNA library and then directly sequencing the cDNA.
While learning how to deal with the raw data RNA-seq produces can be tricky, RNA-seq has capabilities that microarrays lack. The technique can reveal previously uncharacterized transcripts, gene fusions, and genetic polymorphisms, while microarrays only pull out transcripts that researchers explicitly fish for. With sufficient sequencing depth, RNA-seq can also detect high-abundance and low-abundance transcripts more effectively than microarrays can.
Scientists are voting with their feet. The revenue that Affymetrix, a leading microarray producer, received from the gene expression portion of its business decreased from $104.5 million in 2012 to $73.4 million by 2014, according to the company’s 2014 annual report. In 2009, nearly all NIH funding to grants in their first year concerning gene expression went to projects using microarrays, according to an analysis by Dave Delano, senior product manager for gene expression and regulation at Illumina. By 2013, microarrays’ share had fallen to approximately one-third of new funding.
But due to the ease of using microarrays to analyze large numbers of samples rapidly, the technology continues to dominate RNA-seq in terms of sheer numbers of samples analyzed. Weida Tong, director of bioinformatics and biostatistics at the US Food and Drug Administration (FDA) National Center for Toxicological Research in Jefferson, Arkansas, notes that, in 2014, data from more than 54,000 samples analyzed via arrays were deposited into the Gene Expression Omnibus (GEO) database, compared to data from just around 9,000 samples analyzed using RNA-seq (Genome Biol, 15:523, 2014).
While methods for analyzing microarray data are fully mature and straightforward, there is no consensus on which pipelines—or series of computational steps—to use to analyze RNA-seq data.
Eventually, the research community will fully switch to RNA-seq, Tong says. Until then, microarray and RNA-seq data need to be more compatible and the data analysis and storage for RNA-seq must become easier. “This is just giving birth,” Tong says. “It’s painful, but once the process is finished, the community can enjoy this technology.”
Here, The Scientist discusses the transition from microarrays to RNA-seq, when researchers should make the switch, and strategies for making the process as painless as possible.
A WHOLE NEW WORLD
For applications such as exploratory work or research using nonmodel organisms, RNA-seq is a clear winner because it reveals transcriptomes without bias, uncovering novel splice junctions, small RNAs, and even novel genes that microarrays simply miss. (See “Transcriptomics for the Animal Kingdom,” The Scientist, July 2013.)
“Unlike microarray probes, RNA sequencing does not require a priori sequence knowledge of the sample for analysis,” Kevin Poon, global product manager of gene regulation at Agilent Technologies in Santa Clara, California, writes in an e-mail to The Scientist. “In this way, it is an ideal platform for discovery research; obtaining the absolute sequence of transcripts enables the discovery of mutations and fusion transcripts.” Agilent produces both microarrays and RNA-seq tools.
Mariano Alvarez, a graduate student in the lab of Christina Richards at the University of South Florida (USF) in Tampa, studies how the 2010 Gulf oil spill has affected gene expression in the hexaploid salt marsh grass Spartina alterniflora. Alvarez and his collaborators started out using microarrays to assess gene expression in oil-exposed versus nonexposed plants. But for a new project surveying gene expression in invasive populations of Japanese knotweed (Fallopia japonica), the researchers are including RNA-seq data, in hopes of better understanding how expression of gene variants and isoforms differ in different habits.
RNA-seq has also been valuable for exploring the uncharted regions of even well-studied species’ transcriptomes. For instance, in December the University of Toronto’s Benjamin Blencowe and his colleagues used a novel RNA-seq computational method to demonstrate altered transcription patterns of tiny snippets of DNA called microexons in different brain tissues and in people with autism versus controls (Cell, 159:1511-23, 2014).
Researchers who switch to RNA-seq are often “seeing dimensions of the biology they just weren’t picking up in microarrays,” says Anup Parikh, a senior product manager at Thermo Fisher Scientific, which sells RNA-seq tools through its Ion Torrent brand.
RNA-seq is also the right choice for researchers hoping to detect transcripts expressed at very low abundance, according to some scientists. Last year, the FDA’s Tong and his colleagues used both Illumina’s RNA-seq platform and Affymetrix microarrays to assess changes in gene expression in rat liver samples following chemical treatments (Nature Biotechnol, 32:926-32, 2014). The researchers found that, for the more abundant half of the differentially expressed genes, the two platforms were in near-complete agreement. For the less-expressed genes, RNA-seq was more accurate. Other studies support this conclusion (BMC Bioinformatics, 14:9, 2013; PLOS ONE, 9:e78644, 2014).
The main reason for the difference is that when transcript levels are low the fluorescence emitted by those cDNAs bound to a probe in a microarray can be so low that it is outcompeted by background fluorescence. RNA-seq, meanwhile, can detect increasingly lower levels of transcripts the higher the coverage used, with no hard bottom limit. The same is true at the top end of gene expression. For highly expressed genes, microarrays can become saturated.
Despite RNA-seq’s strong points, many researchers continue to use microarrays, particularly for research involving large numbers of samples. And microarrays shine in clinical studies, as data can be turned around quickly and easily. “Microarrays provide highly consistent data and use well-established analytical pipelines,” says Poon. “From analyzing hundreds to thousands of samples, gene and miRNA expression signatures have been developed with clinical diagnostic value.”
“I’ll always do microarray,” says Kirk Mantione, head of molecular biology for the fledgling mitochondrial therapeutics company MitoGenetics based in Farmingdale, New York. “I know how to do it already, and the results are more easily interpreted.”
Mantione uses microarrays to assess the effects of drugs he is developing on gene expression in cell lines and in animals. Microarrays can quickly and easily tell him how compounds affect specific genes. However, Mantione also hopes to begin using RNA-seq to study underexplored organisms or to look for previously undetected polymorphisms in the transcripts he studies.
Affymetrix suggests that some researchers may want to use microarrays to quickly screen large numbers of samples and then use the results to guide their RNA-seq projects. Or, microarrays could be used to validate RNA-seq data. And sometimes people simply continue to use microarrays because they want to compare new data with previously gathered data, which is easier if all of the data are produced the same way.
Despite the decreasing costs of sequencing, the expense of microarrays versus RNA-seq remains a factor for some projects. Microarray analyses cost as little as $100 per sample for standard gene-expression analysis and $300 per sample for more-complex analyses involving differentiation between variant splice forms, according to Affymetrix. (This estimate excludes extra fees from service providers.) The cost of RNA-seq is more variable, as it depends on number of reads, read length, sequencing technique, number of samples that fit into a single run, and familiarity with the species being sequenced, among other factors.
“It’s definitely possible to get the lowest cost sequencing to be at least equivalent to the cost of a microarray,” says Shawn Baker, chief science officer and co-founder of AllSeq, an online sequencing marketplace where individuals, core facilities, and companies can sell their services to users. “It’s just that most people don’t do that. Instead of really trying to drive the cost down as much as possible, they try to get more out of the data.” They want to look at low-abundance genes, or novel gene fusions or splice junctions, for instance, which drives the cost up.
Baker recently analyzed prices of RNA-seq projects commissioned via the AllSeq marketplace. He found that if he normalized prices to projects with 10 million reads, or DNA fragments sequenced per sample, the average RNA-seq project cost was quoted at approximately $565 per sample. However, many researchers do more than 10 million reads, which could drive up the cost considerably. Affymetrix estimates that RNA-seq costs anywhere from a few hundred to a few thousand dollars per sample.
Researchers will also need to take into account how time-consuming data analysis and storage will be. According to Affymetrix, microarray data can be translated into a usable format nearly immediately through free software provided by the company, while RNA-seq analysis can take weeks or even months. “This massively reduces the overall costs for each microarray project relative to RNA-seq,” Affymetrix says.
But Tong says that, in many types of studies, such as toxicological evaluations, the difference between the price of RNA-seq and microarray analysis has become negligible. Researchers analyzing huge numbers of in vitro samples are more likely to find that price differences add up.
PLOS ONE, DOI:10.1371/journal.pone.0078644, 2014.Perhaps the largest obstacle for researchers trying out RNA-seq is the tsunami of data that the process produces. For one study comparing microarrays and RNA-seq, each Affymetrix array produced 5 megabytes of raw data. RNA-seq on the same immune-cell sample using 100 million reads—admittedly extraordinarily high coverage—produced using the Illumina system generated a staggering 23 gigabytes of raw data per sample (PLOS ONE, 9:e78644, 2014).
While methods for analyzing microarray data are fully mature and straightforward, there is no consensus on which pipelines—or series of computational steps—to use to analyze RNA-seq data. Riki Kawaguchi, a senior bioinformatics scientist at the University of California, Los Angeles, says he often compares the performance of several pipelines on each data set he analyzes for collaborators. “There’s no one best package you can use for every data set,” he says.
“Once you have basic programming skills, it’s pretty straightforward to use existing pipelines to clean up your data and to get it into the format you need . . . and then either to upload into a cloud service or analyze it using the software you have in the lab,” says USF’s Alvarez. “I think it is just a time commitment; if you’re used to using microarrays, you just sort of have to bite the bullet and dive in.”
Illumina’s Delano adds that last year the company introduced applications in BaseSpace, a computing environment for sequencing data analysis and management, that should also make RNA-seq easier for the average user. These applications allow researchers to select and apply various popular pipelines to their data.
One way to ease into the RNA-seq era is to outsource the process. Researchers can pay a core facility or sequencing service provider to carry out library preparation and sequencing. Baker of AllSeq says that researchers should probably only purchase and use their own sequencing machines if they know they will frequently need maximally fast turnaround, if they have enough samples to keep their machine running constantly, or if they’re doing something very unusual and experimental with their sequencing technique. Science Exchange and Genohub are other companies that will connect researchers with service providers.
Alternatively, Thermo Fisher recently came out with a service that attempts to make RNA-seq a little more similar to microarrays. For researchers just hoping to sequence known transcripts, the company last August launched AmpliSeq Transcriptome, a library preparation kit and sequencing workflow that reverse transcribes and amplifies 20,000 human RNA transcripts at once using PCR. Researchers then sequence the resulting library using the Ion Proton sequencer and use Thermo Fisher software to convert data to a format that should be recognizable and manageable to researchers familiar with microarrays.
“If what you really care about is gene expression of the known transcripts within your experiment, AmpliSeq Transcriptome will give you exactly that at a much lower cost and complexity than the whole RNA-seq method,” Parikh says. The cost of the Ion AmpliSeq Transcriptome Human Gene Expression Kits ranges from $65 to $104 per assay, depending on whether researchers buy the 24-, 96-, or 384-assay kits.
While researchers using AmpliSeq will not gain information about novel transcripts, they will profit from some of RNA-seq’s benefits, such as good ability to detect transcripts expressed at especially high or low levels. AmpliSeq is also well-suited for preparing samples with low RNA abundance and quality, such as preserved patient tissues.
Beyond the AmpliSeq Transcriptome kit, Thermo Fisher also offers customized AmpliSeq panels allowing researchers to assess chosen transcripts. Illumina, meanwhile, offers TruSeq Targeted RNA Expression Kits, which similarly allow researchers to assess expression of a subset of genes, isoforms, gene variants, or other features whose expression most interests them. And Agilent SureSelect allows researchers to select targeted regions of the transcriptome for sequencing. Using SureSelect, researchers can detect novel transcripts and polymorphisms in their targeted genomic region.
The benefit of sequencing only a limited portion of the transcriptome is that the resulting data output is less overwhelming than full RNA-seq results. Thermo Fisher’s Scott Dewell, a product manager at the company, says for the “everyperson,” it is much easier to simply get the spreadsheet with expression values that AmpliSeq produces. “You could e-mail the results to your collaborator,” Dewell says.