An AI Lab Partner Helps Sift Through Transcriptomics Data

Big omics datasets can be overwhelming for researchers with limited programming skills, but texting with a new AI chatbot could help them wade through their results.

Written byKamal Nahas, PhD
| 4 min read
A UMAP projection of a large transcriptomics dataset.
Register for free to listen to this article
Listen with Speechify
0:00
4:00
Share

Technological advances in sequencing have fueled the “omics revolution,” making big data a staple of biological research. However, many researchers feel ill-equipped to wrangle and analyze these massive datasets, leading them to seek the help of bioinformaticians. Now, with the help of advancing artificial intelligence (AI) technology, analysis can become less of an impediment.

Reporting in a bioRxiv preprint that has yet to undergo peer-review, researchers developed an AI chatbot called CellWhisperer that analyzes transcriptomics data and reports back its findings in plain English.1 Now, researchers with limited computational chops can probe their dense datasets by providing CellWhisperer with non-technical queries, such as “What are these selected cells?” or “Describe the sample concisely.”

Last year, AI algorithms called large language models spooked the world with their ability to respond to prompts in articulate English, but some have looked past their startling nature to streamline data analysis. Biologists have begun training these models on literature repositories to quickly retrieve information from publications. GeneGPT, for example, can answer questions about genes by consulting genomics databases.2 Moritz Schaefer, a bioinformatician at the Medical University of Vienna and study coauthor, wanted to harness AI to simplify analysis of transcriptomics data. “Right now, biologists need to learn programming languages,” he noted. “We wanted to turn this around and said, ‘the computer should learn English.’”

When a bioinformatician analyzes transcriptomics data, they draw on past research for contextual information about patterns of gene expression. For example, they check if a group of genes are typically expressed together by cross-comparing with historic datasets. An AI model needs access to the same resources, so Schaefer and his colleagues trained their algorithm on pre-existing transcriptomics data. They used 20,000 studies from Gene Expression Omnibus and nearly 400,000 human transcriptomes from CELLxGENE Census.3,4 Together, these repositories equipped the AI tool with the training materials it needs to recognize a cell type or a disease state based on its gene expression patterns.

To make their tool even more user-friendly, they paired their trained model with an AI chatbot that could respond to prompts written in English. They turned to a fine-tunable open-source large language model called Mistral 7B and customized it using over 100,000 examples of conversational questions and answers about transcriptomics data.5 Simple queries included “Give a brief description of these cells,” whereas complex prompts tasked the model to list the most prominently expressed genes or the most active cellular pathways. At last, they had developed an AI chatbot adept at discussing transcriptomics and made it publicly available in October of this year.

To take CellWhisperer for a test run, Schaefer queried the model about transcriptomics studies that they excluded from the training data. Starting with an easy task, his team confirmed that, most of the time, the model correctly identified distinctive cell types from multiple organs, including fat, muscle, lung, and skin.6 It grappled slightly with the complexities of distinguishing between similar cell types, namely ones in the pancreas.7 However, the model struggled with a few transcriptomic samples from diseased cells, suggesting that the training data lacked sufficient information about these conditions. Schaefer said CellWhisperer works well with some conditions, such as certain liver cancers, but struggles more with other diseases, such as skin melanoma.

Continue reading below...

Like this story? Sign up for FREE Newsletter updates:

Latest science news storiesTopic-tailored resources and eventsCustomized newsletter content
Subscribe

Although CellWhisperer made correct predictions most of the time, Schaefer said that users should be aware that AI tools can make occasional errors. “It’s important to keep in mind that this AI tool is especially helpful for explorative analysis and brainstorming, and all its responses need to be cross-checked with other experiments,” Schaefer noted.

Maxim Nosenko, an immunologist at Trinity College Dublin who was not involved with the work, said, “Anyone can analyze the sequencing data using CellWhisperer, so that’s certainly a big advantage.” He added, “This tool is really timely now when there is a huge amount of sequencing data.” However, he noted that, in its present form, CellWhisperer is limited to data on human cells since the researchers excluded animal findings. “It is not applicable, for now at least, to mouse studies,” said Nosenko, who uses mice as a model species.

Schaefer plans to build on CellWhisperer’s capabilities. “We want to develop this further to become a semi-autonomous research assistant,” he said. Currently, CellWhisperer responds to single queries one at a time, but Schaefer hopes the tool will eventually carry out a comprehensive analysis on its own without the need for small talk.

Related Topics

Meet the Author

  • Kamal Nahas

    Kamal is a freelance science journalist based in the UK with a PhD in virology from the University of Cambridge. He enjoys writing about the quirky side of biology, like the remarkable extent to which we depend on our gut bacteria, as well as technological breakthroughs, including how artificial intelligence can be leveraged to design proteins. His work has also appeared in Live Science, Nature, New Scientist, Science, Scientific American, and other places. Find him at www.kamalnahas.com or on X @KLNahas.

    View Full Profile
Share
You might also be interested in...
Loading Next Article...
You might also be interested in...
Loading Next Article...
Illustration of a developing fetus surrounded by a clear fluid with a subtle yellow tinge, representing amniotic fluid.
January 2026, Issue 1

What Is the Amniotic Fluid Composed of?

The liquid world of fetal development provides a rich source of nutrition and protection tailored to meet the needs of the growing fetus.

View this Issue
Skip the Wait for Protein Stability Data with Aunty

Skip the Wait for Protein Stability Data with Aunty

Unchained Labs
Graphic of three DNA helices in various colors

An Automated DNA-to-Data Framework for Production-Scale Sequencing

illumina
Exploring Cellular Organization with Spatial Proteomics

Exploring Cellular Organization with Spatial Proteomics

Abstract illustration of spheres with multiple layers, representing endoderm, ectoderm, and mesoderm derived organoids

Organoid Origins and How to Grow Them

Thermo Fisher Logo

Products

Brandtech Logo

BRANDTECH Scientific Introduces the Transferpette® pro Micropipette: A New Twist on Comfort and Control

Biotium Logo

Biotium Launches GlycoLiner™ Cell Surface Glycoprotein Labeling Kits for Rapid and Selective Cell Surface Imaging

Colorful abstract spiral dot pattern on a black background

Thermo Scientific X and S Series General Purpose Centrifuges

Thermo Fisher Logo
Abstract background with red and blue laser lights

VANTAstar Flexible microplate reader with simplified workflows

BMG LABTECH