AI Tool Identifies Disease-Driving Promoter Mutations

A deep-learning model trained on human data reveals that promoter mutations may explain a significant portion of unsolved rare diseases.

Written bySahana Sitaraman, PhD
| 3 min read
An abstract illustration depicting the use of artificial intelligence to study the human genome.
Register for free to listen to this article
Listen with Speechify
0:00
3:00
Share

Each cell in the body is like a miniscule bowl of genetic soup, holding RNA from thousands of genes. But unlike an actual bowl of soup, which can forgive a little too much or not enough of certain ingredients, deviations from normal levels of gene expression can have deleterious effects on cells. These gene expression changes can lead to diseases such as cancer, diabetes, autoimmunity, and cardiovascular disorders.

Various DNA regulatory elements work together to ensure gene expression matches the cell’s needs, key among which is the gene promoter. Scientists have previously shown that promoters can drastically alter gene expression, suggesting that mutations in these non-coding regions of DNA could be the underlying causes of rare genetic disorders and cancer.1 However, identifying such pathogenic promoter variants is difficult.

A photo of Kyle Farh, a human geneticist and Vice President at Illumina

Kyle Farh, a human geneticist and Vice President at Illumina's artificial intelligence lab, develops artificial intelligence tools to study the human genome.

Kyle Farh

“Evolution doesn't allow them to stick around. Since they're so rare, a clinical lab will very often never see that variant again,” said Kyle Farh, a human geneticist and vice president at Illumina's artificial intelligence lab. “Under these circumstances, it's really difficult to take a non-coding variant and claim that it could explain the patient's disease.”

The tools needed to pinpoint the problematic variants have been lacking. To fill this gap, Farh and his team have developed a deep neural network—PromoterAI—that accurately identifies promoter variants that dysregulate gene expression.2 The model, published in Science, estimates that promoter mutations account for six percent of genetic diseases.

“The link to rare disease genetics is very impressive and convincing,” said Fabian Theis, a computational biologist at the Technical University of Munich who was not involved in the study. “That’s where a lot of these mutations often have strong impacts.”

Farh has been passionate about studying the non-coding genome for many years and has developed artificial intelligence tools to understand the consequences of splice and missense variants.3,4 Inspired by the performance of these models, he decided to build PromoterAI. Inputting the DNA sequences around promoters into their model, Farh and his team first trained the network to predict how a variant would affect DNA accessibility, histone modifications, and enzyme interactions with DNA, all of which contribute to gene expression. To make the model more robust, the researchers introduced data from the Genotype-Tissue Expression v8 (GTEx) cohort, a repository that contains data on rare non-coding promoter variants that drive gene expression to an extreme. The database provided paired whole genome sequence and RNA levels in over 800 individuals and thus helped the model learn to predict if a promoter variant would change gene expression.

Next, Farh wanted to test how well PromoterAI performed with cohorts it had never encountered. He introduced information from the UK Biobank, which consists of whole genome sequences and blood plasma protein levels from more than 50,000 participants. The model accurately picked out promoter variants from the genome that drastically altered protein expression. When tested on a cohort with rare variants, PromoterAI’s predictions held up well.

To make the tool more medically amenable, Farh and his team tested it on datasets that included paired information on gene expression levels and clinical outcomes. They observed that it could identify promoter variants that affect common biomarkers, like liver enzymes and cholesterol levels.

One of Farh and his team’s main goals was to understand the causes of rare genetic diseases a little better. “Right now, for rare disease, we're diagnosing about one third of cases looking at protein coding sequence alone. So, there must be an explanation for the cases that we're missing,” he said. “We suspect many of them can be explained by non-coding mutations.” When the team applied PromoterAI to a cohort of patients affected by rare diseases or cancer, they observed that promoter variants accounted for six percent of the burden. When combined with splice variants, the prevalence shot up to 20 percent.

Farh is already investigating other non-coding elements, such as enhancers and microRNA target sites, and their contributions to human health. “These are also very well conserved, strongly indicating that mutations in these regions can definitely lead to penetrative rare disease or cancer,” he said.

According to Theis, this is a jumping point for understanding the language of DNA. “Why just look at promoter variants?” he said. “People can now easily access these models and start concatenating them to ask more complex questions.”

  1. Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell. 2013;152(6):1237-1251.
  2. Jaganathan K, et al. Predicting expression-altering promoter mutations with deep learning. Science. 2025:eads7373.
  3. Jaganathan K, et al. Predicting splicing from primary sequence with deep learning. Cell. 2019;176(3):535-548.e24.
  4. Gao H, et al. The landscape of tolerated genetic variation in humans and primates. Science. 2023;380(6648):eabn8197.

Related Topics

Meet the Author

  • Sahana Sitaraman, PhD

    Sahana is a science journalist based in Lausanne, Switzerland. She holds a bachelor's degree in microbiology from the University of Delhi, India and a master's and PhD in life sciences from the National Centre for Biological Sciences in Bangalore, India. Sahana enjoys writing about health and neuroscience, mental health and women in STEM. She also dabbles in illustrating findings that tickle her brain.

    View Full Profile
Share
You might also be interested in...
Loading Next Article...
You might also be interested in...
Loading Next Article...
July Digest 2025
July 2025, Issue 1

What Causes an Earworm?

Memory-enhancing neural networks may also drive involuntary musical loops in the brain.

View this Issue
Screening 3D Brain Cell Cultures for Drug Discovery

Screening 3D Brain Cell Cultures for Drug Discovery

Explore synthetic DNA’s many applications in cancer research

Weaving the Fabric of Cancer Research with Synthetic DNA

Twist Bio 
Illustrated plasmids in bright fluorescent colors

Enhancing Elution of Plasmid DNA

cytiva logo
An illustration of green lentiviral particles.

Maximizing Lentivirus Recovery

cytiva logo

Products

The Scientist Placeholder Image

Sino Biological Sets New Industry Standard with ProPure Endotoxin-Free Proteins made in the USA

sartorius-logo

Introducing the iQue 5 HTS Platform: Empowering Scientists  with Unbeatable Speed and Flexibility for High Throughput Screening by Cytometry

parse_logo

Vanderbilt Selects Parse Biosciences GigaLab to Generate Atlas of Early Neutralizing Antibodies to Measles, Mumps, and Rubella

shiftbioscience

Shift Bioscience proposes improved ranking system for virtual cell models to accelerate gene target discovery