Each cell in the body is like a miniscule bowl of genetic soup, holding RNA from thousands of genes. But unlike an actual bowl of soup, which can forgive a little too much or not enough of certain ingredients, deviations from normal levels of gene expression can have deleterious effects on cells. These gene expression changes can lead to diseases such as cancer, diabetes, autoimmunity, and cardiovascular disorders.
Various DNA regulatory elements work together to ensure gene expression matches the cell’s needs, key among which is the gene promoter. Scientists have previously shown that promoters can drastically alter gene expression, suggesting that mutations in these non-coding regions of DNA could be the underlying causes of rare genetic disorders and cancer.1 However, identifying such pathogenic promoter variants is difficult.

Kyle Farh, a human geneticist and Vice President at Illumina's artificial intelligence lab, develops artificial intelligence tools to study the human genome.
Kyle Farh
“Evolution doesn't allow them to stick around. Since they're so rare, a clinical lab will very often never see that variant again,” said Kyle Farh, a human geneticist and vice president at Illumina's artificial intelligence lab. “Under these circumstances, it's really difficult to take a non-coding variant and claim that it could explain the patient's disease.”
The tools needed to pinpoint the problematic variants have been lacking. To fill this gap, Farh and his team have developed a deep neural network—PromoterAI—that accurately identifies promoter variants that dysregulate gene expression.2 The model, published in Science, estimates that promoter mutations account for six percent of genetic diseases.
“The link to rare disease genetics is very impressive and convincing,” said Fabian Theis, a computational biologist at the Technical University of Munich who was not involved in the study. “That’s where a lot of these mutations often have strong impacts.”
Farh has been passionate about studying the non-coding genome for many years and has developed artificial intelligence tools to understand the consequences of splice and missense variants.3,4 Inspired by the performance of these models, he decided to build PromoterAI. Inputting the DNA sequences around promoters into their model, Farh and his team first trained the network to predict how a variant would affect DNA accessibility, histone modifications, and enzyme interactions with DNA, all of which contribute to gene expression. To make the model more robust, the researchers introduced data from the Genotype-Tissue Expression v8 (GTEx) cohort, a repository that contains data on rare non-coding promoter variants that drive gene expression to an extreme. The database provided paired whole genome sequence and RNA levels in over 800 individuals and thus helped the model learn to predict if a promoter variant would change gene expression.
Next, Farh wanted to test how well PromoterAI performed with cohorts it had never encountered. He introduced information from the UK Biobank, which consists of whole genome sequences and blood plasma protein levels from more than 50,000 participants. The model accurately picked out promoter variants from the genome that drastically altered protein expression. When tested on a cohort with rare variants, PromoterAI’s predictions held up well.
To make the tool more medically amenable, Farh and his team tested it on datasets that included paired information on gene expression levels and clinical outcomes. They observed that it could identify promoter variants that affect common biomarkers, like liver enzymes and cholesterol levels.
One of Farh and his team’s main goals was to understand the causes of rare genetic diseases a little better. “Right now, for rare disease, we're diagnosing about one third of cases looking at protein coding sequence alone. So, there must be an explanation for the cases that we're missing,” he said. “We suspect many of them can be explained by non-coding mutations.” When the team applied PromoterAI to a cohort of patients affected by rare diseases or cancer, they observed that promoter variants accounted for six percent of the burden. When combined with splice variants, the prevalence shot up to 20 percent.
Farh is already investigating other non-coding elements, such as enhancers and microRNA target sites, and their contributions to human health. “These are also very well conserved, strongly indicating that mutations in these regions can definitely lead to penetrative rare disease or cancer,” he said.
According to Theis, this is a jumping point for understanding the language of DNA. “Why just look at promoter variants?” he said. “People can now easily access these models and start concatenating them to ask more complex questions.”
- Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell. 2013;152(6):1237-1251.
- Jaganathan K, et al. Predicting expression-altering promoter mutations with deep learning. Science. 2025:eads7373.
- Jaganathan K, et al. Predicting splicing from primary sequence with deep learning. Cell. 2019;176(3):535-548.e24.
- Gao H, et al. The landscape of tolerated genetic variation in humans and primates. Science. 2023;380(6648):eabn8197.