Compression Technique Shrinks the Size of Massive Pangenome Data Stores

Pangenomics studies generate huge amounts of data. A new data compression method could make pangenomic analysis accessible to more researchers.

Written byRJ Mackenzie
| 3 min read
DNA data with server room racks in the back.
Register for free to listen to this article
Listen with Speechify
0:00
3:00
Share

The recent revolution in genomic sequencing has opened new fields of study, but it’s also produced a DNA data deluge that has led to almost too much genomic information for researchers to handle.

In a new study, published in Nature Genetics, researchers have described a method that could help the largest genomics projects handle these vast data volumes by achieving unrivaled levels of compression.1 The new approach could make these data resources accessible and usable to a wider group of scientists.

Pangenomics: DNA Analysis at the Largest Scale

While early genomics projects focused on representative reference genomes derived from a single individual, the emerging field of pangenomics has set bigger goals. In pangenomics studies, researchers assemble many genomes from a single species to capture all of the genetic variation present in that species’ DNA. This approach can demonstrate how mutations affect pathogen spread or drug resistance.

While this approach may offer a broader lens for research, it also puts significant strain on labs’ data storage. Consortia storing pangenomes are amassing terabytes of uncompressed FASTA files (text-based files of nucleotide sequences), and the data handling required to make these files accessible can take an impractically long time.2

These pangenomic data resources are also challenging to visualize. Graph-based data formats have become popular in the field, but these approaches still have high storage requirements and don’t capture all of the relevant information from the genomes’ genetic history. This includes the collected genomes’ shared mutational and evolutionary histories.

“The data structures used for pangenomics research are critical because they determine not only how efficiently genetic data is represented, but also what the data can represent,” said Sumit Walia, a study coauthor and electrical engineer at the University of California, San Diego (UCSD), in a statement.

Continue reading below...

Like this story? Sign up for FREE Genetics updates:

Latest science news storiesTopic-tailored resources and eventsCustomized newsletter content
Subscribe

Walia and a team led by UCSD engineer Yatish Turakhia have developed a new file format and data structure, called Pangenome Mutation-Annotated Network (PanMAN), that could maximize the potential of pangenomic data.

Impressing by Compressing

In their new study, the team tested PanMAN’s ability to compress genomic data on the SARS-CoV-2 virus genome. They first created a massive viral pangenome, made up of eight million separate viral genomes. They were able to compress it 3,000-fold, reducing this trove of genetic data to 366 megabytes—about half the file size of a mid-definition TV episode.

The format also allows researchers to directly analyze this compressed data, opening up unusably large data volumes for study. “Our compressive technique with PanMANs allows doing more with less, greatly improving the scale and scope of current pangenomic analysis,” said Turakhia in a statement.

The PanMAN format visualizes individual genomes as the roots of graphical trees. Different branches of the tree represent genomic features, such as mutations. Complex mutations involving multiple parent sequences are shown as connecting edges between these trees. This means that single mutations are stored only once on shared branches, rather than in multiple locations. The technique also directly and indirectly stores useful data that other graphical representations miss, such as ancestral sequences and phylogeny.

The team’s next step is to apply PanMAN to human genomes to broaden the technique’s impact.

“Extending compressive pangenomics to human genomes can fundamentally transform how we store, analyze, and share large-scale human genetic data,” said Turakhia. “Besides enabling studies of human genetic diversity, disease, and evolution at unprecedented scale and speed, it can depict detailed evolutionary and mutational histories which shape diverse human populations, something that current representations do not capture.”

  1. Walia S, et al. Compressive pangenomics using mutation-annotated networks. Nat Genet. 2026:1-9.
  2. Deorowicz S, et al. AGC: Compact representation of assembled genomes with fast queries and updates. Bioinformatics. 2023;39(3):btad097.
Add The Scientist as a preferred source on Google

Add The Scientist as a preferred Google source to see more of our trusted coverage.

Related Topics

Meet the Author

  • RJ Mackenzie

    RJ is a freelance science writer based in Glasgow. He covers biological and biomedical science, with a focus on the complexities and curiosities of the brain and emerging AI technologies. RJ was a science writer at Technology Networks for six years, where he also worked on the site’s SEO and editorial AI strategies. He created the site’s podcast, Opinionated Science, in 2020. RJ has a Master’s degree in Clinical Neurosciences from the University of Cambridge.

    View Full Profile
Share
You might also be interested in...
Loading Next Article...
You might also be interested in...
Loading Next Article...
Image of a man in a laboratory looking frustrated with his failed experiment.
February 2026

A Stubborn Gene, a Failed Experiment, and a New Path

When experiments refuse to cooperate, you try again and again. For Rafael Najmanovich, the setbacks ultimately pushed him in a new direction.

View this Issue
Human-Relevant In Vitro Models Enable Predictive Drug Discovery

Advancing Drug Discovery with Complex Human In Vitro Models

Stemcell Technologies
Redefining Immunology Through Advanced Technologies

Redefining Immunology Through Advanced Technologies

Ensuring Regulatory Compliance in AAV Manufacturing with Analytical Ultracentrifugation

Ensuring Regulatory Compliance in AAV Manufacturing with Analytical Ultracentrifugation

Beckman Coulter logo
Conceptual multicolored vector image of cancer research, depicting various biomedical approaches to cancer therapy

Maximizing Cancer Research Model Systems

bioxcell

Products

Sino Biological Logo

Sino Biological Pioneers Life Sciences Innovation with High-Quality Bioreagents on Inside Business Today with Bill and Guiliana Rancic

Sino Biological Logo

Sino Biological Expands Research Reagent Portfolio to Support Global Nipah Virus Vaccine and Diagnostic Development

Beckman Coulter

Beckman Coulter Life Sciences Partners with Automata to Accelerate AI-Ready Laboratory Automation

Graphic of amino acid chains folded into proteins

Expi293™ PRO Expression System: Higher Yields Across a Wider Variety of Proteins

Thermo Fisher Logo