An AI Lab Partner Helps Sift Through Transcriptomics Data

Big omics datasets can be overwhelming for researchers with limited programming skills, but texting with a new AI chatbot could help them wade through their results.

Kamal Nahas
| 4 min read
A UMAP projection of a large transcriptomics dataset.

Cells vary greatly in their gene expression patterns, complicating interpretation of big datasets.

Moritz Schaefer

Register for free to listen to this article
Listen with Speechify
0:00
4:00
Share

Technological advances in sequencing have fueled the “omics revolution,” making big data a staple of biological research. However, many researchers feel ill-equipped to wrangle and analyze these massive datasets, leading them to seek the help of bioinformaticians. Now, with the help of advancing artificial intelligence (AI) technology, analysis can become less of an impediment.

Reporting in a bioRxiv preprint that has yet to undergo peer-review, researchers developed an AI chatbot called CellWhisperer that analyzes transcriptomics data and reports back its findings in plain English.1 Now, researchers with limited computational chops can probe their dense datasets by providing CellWhisperer with non-technical queries, such as “What are these selected cells?” or “Describe the sample concisely.”

Last year, AI algorithms called large language models spooked the world with their ability to respond to prompts in articulate English, but some have looked past their startling nature to streamline data analysis. Biologists have begun training these models on literature repositories to quickly retrieve information from publications. GeneGPT, for example, can answer questions about genes by consulting genomics databases.2 Moritz Schaefer, a bioinformatician at the Medical University of Vienna and study coauthor, wanted to harness AI to simplify analysis of transcriptomics data. “Right now, biologists need to learn programming languages,” he noted. “We wanted to turn this around and said, ‘the computer should learn English.’”

When a bioinformatician analyzes transcriptomics data, they draw on past research for contextual information about patterns of gene expression. For example, they check if a group of genes are typically expressed together by cross-comparing with historic datasets. An AI model needs access to the same resources, so Schaefer and his colleagues trained their algorithm on pre-existing transcriptomics data. They used 20,000 studies from Gene Expression Omnibus and nearly 400,000 human transcriptomes from CELLxGENE Census.3,4 Together, these repositories equipped the AI tool with the training materials it needs to recognize a cell type or a disease state based on its gene expression patterns.

To make their tool even more user-friendly, they paired their trained model with an AI chatbot that could respond to prompts written in English. They turned to a fine-tunable open-source large language model called Mistral 7B and customized it using over 100,000 examples of conversational questions and answers about transcriptomics data.5 Simple queries included “Give a brief description of these cells,” whereas complex prompts tasked the model to list the most prominently expressed genes or the most active cellular pathways. At last, they had developed an AI chatbot adept at discussing transcriptomics and made it publicly available in October of this year.

To take CellWhisperer for a test run, Schaefer queried the model about transcriptomics studies that they excluded from the training data. Starting with an easy task, his team confirmed that, most of the time, the model correctly identified distinctive cell types from multiple organs, including fat, muscle, lung, and skin.6 It grappled slightly with the complexities of distinguishing between similar cell types, namely ones in the pancreas.7 However, the model struggled with a few transcriptomic samples from diseased cells, suggesting that the training data lacked sufficient information about these conditions. Schaefer said CellWhisperer works well with some conditions, such as certain liver cancers, but struggles more with other diseases, such as skin melanoma.

Although CellWhisperer made correct predictions most of the time, Schaefer said that users should be aware that AI tools can make occasional errors. “It’s important to keep in mind that this AI tool is especially helpful for explorative analysis and brainstorming, and all its responses need to be cross-checked with other experiments,” Schaefer noted.

Maxim Nosenko, an immunologist at Trinity College Dublin who was not involved with the work, said, “Anyone can analyze the sequencing data using CellWhisperer, so that’s certainly a big advantage.” He added, “This tool is really timely now when there is a huge amount of sequencing data.” However, he noted that, in its present form, CellWhisperer is limited to data on human cells since the researchers excluded animal findings. “It is not applicable, for now at least, to mouse studies,” said Nosenko, who uses mice as a model species.

Schaefer plans to build on CellWhisperer’s capabilities. “We want to develop this further to become a semi-autonomous research assistant,” he said. Currently, CellWhisperer responds to single queries one at a time, but Schaefer hopes the tool will eventually carry out a comprehensive analysis on its own without the need for small talk.

Keywords

Meet the Author

  • Kamal Nahas

    Kamal Nahas, PhD

    Kamal is a freelance science journalist based in the UK with a PhD in virology from the University of Cambridge.
Share
You might also be interested in...
Loading Next Article...
You might also be interested in...
Loading Next Article...
TS Digest January 2025
January 2025, Issue 1

Why Do Some People Get Drunk Faster Than Others?

Genetics and tolerance shake up how alcohol affects each person, creating a unique cocktail of experiences.

View this Issue
Sex Differences in Neurological Research

Sex Differences in Neurological Research

bit.bio logo
New Frontiers in Vaccine Development

New Frontiers in Vaccine Development

Sino
New Approaches for Decoding Cancer at the Single-Cell Level

New Approaches for Decoding Cancer at the Single-Cell Level

Biotium logo
Learn How 3D Cell Cultures Advance Tissue Regeneration

Organoids as a Tool for Tissue Regeneration Research 

Acro 

Products

Sapient Logo

Sapient Partners with Alamar Biosciences to Extend Targeted Proteomics Services Using NULISA™ Assays for Cytokines, Chemokines, and Inflammatory Mediators

Bio-Rad Logo

Bio-Rad Extends Range of Vericheck ddPCR Empty-Full Capsid Kits to Optimize AAV Vector Characterization

An illustration of different-shaped bacteria.

Leveraging PCR for Rapid Sterility Testing

Conceptual 3D image of DNA on a blue background.

Understanding the Nuts and Bolts of qPCR Assay Controls 

Bio-Rad