AI Tool Identifies Disease-Driving Promoter Mutations

A deep-learning model trained on human data reveals that promoter mutations may explain a significant portion of unsolved rare diseases.

Written bySahana Sitaraman, PhD
| 3 min read
An abstract illustration depicting the use of artificial intelligence to study the human genome.
Register for free to listen to this article
Listen with Speechify
0:00
3:00
Share

Each cell in the body is like a miniscule bowl of genetic soup, holding RNA from thousands of genes. But unlike an actual bowl of soup, which can forgive a little too much or not enough of certain ingredients, deviations from normal levels of gene expression can have deleterious effects on cells. These gene expression changes can lead to diseases such as cancer, diabetes, autoimmunity, and cardiovascular disorders.

Various DNA regulatory elements work together to ensure gene expression matches the cell’s needs, key among which is the gene promoter. Scientists have previously shown that promoters can drastically alter gene expression, suggesting that mutations in these non-coding regions of DNA could be the underlying causes of rare genetic disorders and cancer.1 However, identifying such pathogenic promoter variants is difficult.

A photo of Kyle Farh, a human geneticist and Vice President at Illumina

Kyle Farh, a human geneticist and Vice President at Illumina's artificial intelligence lab, develops artificial intelligence tools to study the human genome.

Kyle Farh

“Evolution doesn't allow them to stick around. Since they're so rare, a clinical lab will very often never see that variant again,” said Kyle Farh, a human geneticist and vice president at Illumina's artificial intelligence lab. “Under these circumstances, it's really difficult to take a non-coding variant and claim that it could explain the patient's disease.”

The tools needed to pinpoint the problematic variants have been lacking. To fill this gap, Farh and his team have developed a deep neural network—PromoterAI—that accurately identifies promoter variants that dysregulate gene expression.2 The model, published in Science, estimates that promoter mutations account for six percent of genetic diseases.

“The link to rare disease genetics is very impressive and convincing,” said Fabian Theis, a computational biologist at the Technical University of Munich who was not involved in the study. “That’s where a lot of these mutations often have strong impacts.”

Farh has been passionate about studying the non-coding genome for many years and has developed artificial intelligence tools to understand the consequences of splice and missense variants.3,4 Inspired by the performance of these models, he decided to build PromoterAI. Inputting the DNA sequences around promoters into their model, Farh and his team first trained the network to predict how a variant would affect DNA accessibility, histone modifications, and enzyme interactions with DNA, all of which contribute to gene expression. To make the model more robust, the researchers introduced data from the Genotype-Tissue Expression v8 (GTEx) cohort, a repository that contains data on rare non-coding promoter variants that drive gene expression to an extreme. The database provided paired whole genome sequence and RNA levels in over 800 individuals and thus helped the model learn to predict if a promoter variant would change gene expression.

Next, Farh wanted to test how well PromoterAI performed with cohorts it had never encountered. He introduced information from the UK Biobank, which consists of whole genome sequences and blood plasma protein levels from more than 50,000 participants. The model accurately picked out promoter variants from the genome that drastically altered protein expression. When tested on a cohort with rare variants, PromoterAI’s predictions held up well.

To make the tool more medically amenable, Farh and his team tested it on datasets that included paired information on gene expression levels and clinical outcomes. They observed that it could identify promoter variants that affect common biomarkers, like liver enzymes and cholesterol levels.

Continue reading below...

Like this story? Sign up for FREE Genetics updates:

Latest science news storiesTopic-tailored resources and eventsCustomized newsletter content
Subscribe

One of Farh and his team’s main goals was to understand the causes of rare genetic diseases a little better. “Right now, for rare disease, we're diagnosing about one third of cases looking at protein coding sequence alone. So, there must be an explanation for the cases that we're missing,” he said. “We suspect many of them can be explained by non-coding mutations.” When the team applied PromoterAI to a cohort of patients affected by rare diseases or cancer, they observed that promoter variants accounted for six percent of the burden. When combined with splice variants, the prevalence shot up to 20 percent.

Farh is already investigating other non-coding elements, such as enhancers and microRNA target sites, and their contributions to human health. “These are also very well conserved, strongly indicating that mutations in these regions can definitely lead to penetrative rare disease or cancer,” he said.

According to Theis, this is a jumping point for understanding the language of DNA. “Why just look at promoter variants?” he said. “People can now easily access these models and start concatenating them to ask more complex questions.”

  1. Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell. 2013;152(6):1237-1251.
  2. Jaganathan K, et al. Predicting expression-altering promoter mutations with deep learning. Science. 2025:eads7373.
  3. Jaganathan K, et al. Predicting splicing from primary sequence with deep learning. Cell. 2019;176(3):535-548.e24.
  4. Gao H, et al. The landscape of tolerated genetic variation in humans and primates. Science. 2023;380(6648):eabn8197.

Related Topics

Meet the Author

  • Photograph of Sahana Sitaraman. The photograph is in grayscale. Sahana has short, curly hair, round-framed glasses, and is wearing a windbreaker jacket.

    Sahana is an Assistant Editor at The Scientist, where she crafts stories that bring the wonders and oddities of science to life. In 2022, she earned a PhD in neuroscience from the National Centre for Biological Sciences, India, studying how neurons develop their stereotypical tree-like shapes. In a parallel universe, Sahana is a passionate singer and an enthusiastic hiker.

    View Full Profile
Share
You might also be interested in...
Loading Next Article...
You might also be interested in...
Loading Next Article...
Illustration of a developing fetus surrounded by a clear fluid with a subtle yellow tinge, representing amniotic fluid.
January 2026, Issue 1

What Is the Amniotic Fluid Composed of?

The liquid world of fetal development provides a rich source of nutrition and protection tailored to meet the needs of the growing fetus.

View this Issue
Skip the Wait for Protein Stability Data with Aunty

Skip the Wait for Protein Stability Data with Aunty

Unchained Labs
Graphic of three DNA helices in various colors

An Automated DNA-to-Data Framework for Production-Scale Sequencing

illumina
Exploring Cellular Organization with Spatial Proteomics

Exploring Cellular Organization with Spatial Proteomics

Abstract illustration of spheres with multiple layers, representing endoderm, ectoderm, and mesoderm derived organoids

Organoid Origins and How to Grow Them

Thermo Fisher Logo

Products

nuclera logo

Nuclera eProtein Discovery System installed at leading Universities in Taiwan

Brandtech Logo

BRANDTECH Scientific Introduces the Transferpette® pro Micropipette: A New Twist on Comfort and Control

Biotium Logo

Biotium Launches GlycoLiner™ Cell Surface Glycoprotein Labeling Kits for Rapid and Selective Cell Surface Imaging

Colorful abstract spiral dot pattern on a black background

Thermo Scientific X and S Series General Purpose Centrifuges

Thermo Fisher Logo