In 1995, researchers published the world’s first complete genome sequence of a free-living organism, that of the bacterium Haemophilus influenzae. The sequences of many model organisms, including Escherichia coli and Arabidopsis thaliana, followed shortly thereafter, and by 2001, the Human Genome Project had completed the first draft of a person’s entire DNA sequence. With this new capacity to decipher the genetic blueprint for any organism, some researchers believed they held the key to explore the inner workings of every species on Earth—guided by just a single reference genome per species.

“It was not really in the accepted thought that you’d have to [sequence] more than one of anything,” says Hervé Tettelin, an associate professor of microbiology and immunology at the University of Maryland. “The human genome had been done, a bunch of reference genomes had been done, and that was it. [Researchers]...

But with genome sequencing becoming faster, cheaper, and more accessible all the time, it quickly became apparent that this attitude overlooked one very important aspect of biology—genomic variation among individuals. And for some species, such variation can be significant. In the early 2000s, for example, when Tettelin, then at the Institute for Genomic Research (TIGR), and his colleagues compared eight isolates of Streptococcus agalactiae (or group B Streptococcus, GBS), they found not only the small, within-gene variations predicted by conventional genetics, but an average of 33 completely new genes with every new genome sequenced. “It was a shock,” says Tettelin. “We saw there were many regions—relatively large regions—of diversity.”

When researchers compared eight isolates of Streptococcus agalactiae, they found not only the small, within-gene variations predicted by conventional genetics, but an average of 33 completely new genes with every new genome sequenced.

Analyses of S. pyogenes (a group A Streptococcus) revealed a similar story: each new strain added, on average, 27 new genes. Tettelin and his colleagues realized that “we were far from having enough genomes to characterize all the genes [for a given species].” This concerning realization prompted the researchers to propose the concept of the “pangenome,” defined as the entire set of genes possessed by all members of a particular species. “We were trying to find a way to represent this diversity,” Tettelin says.

Publishing on the idea for the first time in 2005, the TIGR team focused on the pangenome’s utility in describing the genomic content of microbial species.1 But geneticists studying the sequences of plants, fungi, and animals soon began to face the same problem with their reference genomes: members of a species do not always share the same genes.

Over the past 10 years, large-scale sequencing projects have revealed startling levels of individual genomic variation across the tree of life, challenging the value of the modern reference genome—as well as the very notion of a species. To better capture the genetic makeup of any given taxon, many researchers now argue that the field of genomics should adopt a pangenomic framework in which diversity is central, rather than incidental, to our view of species.

“It’s exciting,” Tettelin says. “People are realizing this [diversity] is there. I think the approach is going to be used more and more.”

Getting a grip on bacterial diversity

PARTIONING THE GENOME: From the sequence of a single genome, it’s impossible to determine which genes are shared by all members of a species and which are possessed by only some. However, just one additional sequence offers the opportunity to distinguish shared and variable content. As more genomes are sequenced, more genes are discovered and some genes that were believed to be ubiquitous are found to be lacking from certain individuals. As a result, the estimated size of a species’s core genome—the set of genes shared by all members of a species—generally decreases, and the size of the pangenome—the set of all distinct genes in the species—increases.
See full infographic: WEB | PDF
By 2005, public databases stored the complete genome sequences for around 250 bacterial species, with “species” broadly defined as a group of organisms that share more than 97 percent sequence identity in their slowly evolving genes for 16S rRNA. But more than 80 percent of these sequences were assembled from a single bacterial isolate—an approach that the TIGR team’s data suggested fell far short of capturing all the genes found in a species. So in their paper, now cited more than 1,000 times, Tettelin and his colleagues laid out the founding principles for a new way of thinking.

Instead of a single reference genome, the authors argued for a description of bacterial species using the set of all genes identified as belonging to members of that species (or “operational taxonomic unit,” as bacteria grouped by close similarity in their 16S rRNA sequences are commonly known). Those genes could be subdivided into core genes, present in every strain sequenced; variable, or “dispensable,” genes that can be found only in some strains; and unique genes, restricted to just one strain.

The pangenome view changes the way you think about what an organism is.—Sallie Chisholm, MIT

The team also presented a method for estimating the total size of a taxon’s pangenome (sometimes referred to as the supragenome) based on the number of new genes discovered in each complete sequence. “We came up with the principle of doing all combinations of adding every genome to another,” says Tettelin. If the average number of new genes with each new genome shows no sign of plateauing, the pangenome is theoretically infinite, and said to be open; if the average number of new genes is asymptotic, the pangenome has a more predictable size, and is considered closed. (See illustration above.)

“It was pioneering work,” says Chitra Dutta of the Indian Institute of Chemical Biology, whose group recently developed software to perform pangenomic analyses on microbial sequence data.2 “They showed how to quantify the genomic diversity of the species, and provided a framework for predicting how many additional whole genomes might be needed to fully characterize the species. It was a giant leap forward.”

The pangenome concept has since been widely adopted by research groups trying to keep track of newly discovered diversity within and among bacterial taxa. In addition to high recombination rates and mobile genetic elements, which have long been known to be drivers of prokaryotic diversity, horizontal gene transfer—direct or indirect exchange of genetic material among even unrelated organisms—is proving to contribute to individual diversity across the bacterial domain and beyond. (See “Bacteria and Humans Have Been Swapping DNA for Millennia,” The Scientist, October 2016.)

Bacteria have “access to a toolkit within species and across species,” Tettelin explains. For taxa with large, open pangenomes—often a reflection of high rates of gene transfer—that toolkit corresponds to a theoretically unlimited pool of easy-to-access new biological functions. “They’re essentially champions at versatility and adaptability,” he says.

MIT microbiologist Sallie Chisholm has spent decades studying this genomic fluidity in Prochlorococcus, a marine cyanobacterium with a global population of around 3 x 1027. Each strain contains a modest 2,000 genes, but 10 years ago, Chisholm and her colleagues began discovering up to a couple hundred previously unknown genes every time they sequenced a new strain. The 2005 paper introducing the pangenome concept “captured my imagination as a new way to order all of that complex information, and to get an idea of where we were headed if we kept sequencing genomes,” she says.

In 2007, Chisholm and her colleagues used the TIGR team’s method to determine an open pangenome for Prochlorococcus, with a size of nearly 6,000 total genes from an initial 12 genome sequences.3 Eight years later, with 45 strains sequenced, they revised that estimate up to at least 80,000 genes—around four times the number of genes in the human genome—with an individual’s core genome comprising only about 1,000 genes, or less than 2 percent of the total gene pool. “That’s a lot of information shaping that collective,” says Chisholm. “[The pangenome view] changes the way you think about what an organism is.”

Understanding genomic versatility is particularly relevant to the study of disease-causing bacteria, which frequently have large numbers of variable genes. For any particular pathogen, “if the focus is to get a vaccine, we need to know all of the genes that this thing has access to, and all the genes that it expresses into proteins and, more importantly, surface-expressed proteins,” explains Tettelin, whose early work in reverse vaccinology—the identification of candidate vaccines using a pathogen’s genome rather than purely immunological or biochemical methods—depended on such information. “If a genome is representative of what’s in the species, then when you have one, it’s game over. But given the diversity we saw, we sort of knew that one genome, and maybe even a few genomes, was not going to be enough for us to find a new cocktail of vaccines that we could take to market.”

Pangenomic analyses in the last decade have allowed researchers to begin to develop universal vaccines that could provide protection against all strains in a species, or even against several related species. In 2005, Tettelin and his colleagues’ work on GBS led to the identification of a potentially universal vaccine based on a combination of four bacterial surface proteins.4 And last June, researchers at the University of California, San Diego, published a pangenome study of hospital superbug methicillin-resistant Staphylococcus aureus (MRSA) from 64 strains as a starting point for developing a widely effective MRSA vaccine.5

To date, researchers have applied the pangenome framework to some 50 bacterial species, including model organisms, such as E. coli, and commercially significant microbes, such as the wine bacterium Oenococcus oeni. Scientists have used the same basic principles to consider shared and variable genes and gene families within larger groups such as genera, the human microbiome, and even the entire bacterial domain. Now, with the approach broadly accepted as a useful way to organize bacterial diversity, efforts are focusing on incorporating this more variation-centric view into metagenomics, phylogenetics, and even taxonomy itself. (See “Blurred Lines” below.)

“We have learned a tremendous amount about the machinery of life from model organisms that have been studied in the lab, but we haven’t really confronted head-on the role of the diversity,” says Chisholm. “Now I think we’re learning that this diversity is part of what life is all about.”

Not just for bugs

As the pangenome concept continues to percolate through the microbiology research community, newly discovered intraspecific variation is beginning to influence genomic descriptions of other taxa. Although eukaryotic species—often loosely defined on the basis of evolutionary or reproductive isolation—are thought to have relatively low levels of horizontal gene transfer and genome rearrangements relative to their promiscuous prokaryotic cousins, sequencing of multiple individuals per species is revealing extensive genomic diversity that goes far beyond small, within-gene differences.

California State University San Marcos researcher Betsy Read realized the importance of such interindividual variation a few years ago, shortly after finishing the construction of a reference genome for the tiny but ubiquitous eukaryotic phytoplankton Emiliania huxleyi. From the Arctic to the tropics, “almost every bucket of water you pull from the ocean is going to contain E. huxleyi,” says Read. Suspecting that the organism’s ability to adapt to such varied conditions might depend on single-nucleotide polymorphisms within genes, the team got to work sequencing more isolates.

Thirteen sequences later, Read and her colleagues were surprised to find that genome size, originally estimated at some 30,000 genes, varied remarkably between strains, with some strains apparently missing more than 2,000 genes. When they conducted a pangenome analysis, the researchers found that just two-thirds of the genes they had identified were shared by all sequenced isolates.6 In particular, Read notes, there was a high degree of variability in genes encoding metal-binding proteins—key components in E. huxleyis adaptation to the environment.

Given the lack of evidence for horizontal gene transfer in E. huxleyi, the availability of the total gene pool to each individual is unlikely to mirror that of prokaryotes. But Read’s team believes that the large pangenome relative to an individual’s core genome underpins this single-celled eukaryote’s adaptability. “E. huxleyi’s got this huge plasticity in its ecophysiology,” she explains. “We think that this genomic variability helps to explain that.”

In pangenomics—and genomics in general—we need to move into a phase of functionality.—Candice Hirsch,
University of Minnesota

E. huxleyi is hardly alone in harboring such diversity in its DNA. Larger-scale sequencing projects covering thousands of whole genomes of model eukaryotic organisms such as Saccharomyces cerevisiae and Arabidopsis thaliana have also revealed significant numbers of duplicated or novel genes. And in crop plants, whose genomes frequently contain large duplicated regions, a handful of studies already support links between the presence or absence of “variable” genes and disease resistance, metabolite production, and stress responses. “It’s becoming increasingly acknowledged that gene difference does have an impact,” says David Edwards, a geneticist at the University of Western Australia who studies variation in the mustard family of plants.

One of the first high-resolution plant pangenomes was published in 2014 by the University of Minnesota’s Candice Hirsch and colleagues, who combined the sequences of 503 inbred lines of maize to categorize differences in gene content and other genomic variation.7 “We found that the reference genome assembly contains less than a third of the total genes that exist in maize,” Hirsch says. “That, for our community, was a pretty big breakthrough to grasp how extensive this variation really is.” Separate pangenome studies have established core and variable genes for rice and the wild relative of soybean. And the planned large-scale resequencing of entire germplasm collections by organizations such as the Global Crop Diversity Trust promises huge volumes of data for future analyses.

In theory, the pangenomic approach to plant species could help identify genes involved in adaptation and inform strategies for the introduction or cultivation of environmentally resilient populations. But that sort of progress will require going beyond mere descriptions of variation, notes Hirsch. “In pangenomics—and genomics in general—we need to move into a phase of functionality,” she says. “We need to move beyond the variation at the genome level, and ask how it impacts phenotypic variation.” And, from a cultivation perspective, “can we introduce this variation artificially in an intelligent way?”

An answer to human variation?

VISUALIZING THE PANGENOME: A reference genome built from the DNA of an individual organism can be visualized as a linear sequence. But there is a growing appreciation that this sort of representation fails to reflect the diversity among individuals of a species, which includes not just sequence variation within shared genes, but often different genes altogether. To visualize the genomic content of a species, researchers use interconnected nodes representing all possible combinations of genomic segments or genes found in a species (shown here). Such an approach makes all known sequence information available simultaneously, instead of hiding some away as annotations describing how newly sequenced genomes differ from a linear reference.
See full infographic: WEB | PDF
Of course, one eukaryotic species that has no shortage of genome sequences is Homo sapiens. Between 2008 and 2015, the 1000 Genomes Project generated the world’s largest public catalog of human variation based on data from 2,504 individuals from 26 populations; in 2012, the U.K.’s 100,000 Genomes Project launched an even bigger effort, with an expected completion date of 2017.

The data from these and smaller projects are revealing that Homo sapiens, too, is fairly poorly described by a reference sequence built on just a handful of genomes. Although variation in genomic content between two humans is minuscule in comparison with that of microbes or plants, the earliest attempt at building a human pangenome in 2009—based on the human reference genome and just two others—estimated that up to 40 megabases of sequence, including protein-coding regions, were absent from the reference genome.8 The same year, another team estimated that gene counts varied between any two randomly chosen people by 73 to 87 genes, largely because of variations in copy number.9 And in 2015, a team in Denmark published the first attempt at a “national human pangenome” using sequence data from 10 sets of Danish father-mother-child trios, highlighting hundreds of thousands of new structural variants.10

With differences in gene number increasingly being associated with disorders including autism, Parkinson’s disease, and Alzheimer’s, there are strong medical justifications for taking a more variation-centric view of our species, says Mark Chance, director of the proteomics and bioinformatics center at Case Western Reserve University. He’s part of a team that recently identified more than 300 small sequences absent from the reference genome but present in at least 1 percent of the human population.11 “There’s human and there’s human and there’s human,” he says. “[The genome] does encompass a lot of variation.”

Some researchers now argue for the adoption of a more pangenome-friendly, graph-based representation of the human genome to more accurately represent this variation. (See illustration here.) But the concept (and the terminology) of a pangenome has certainly not yet “caught on as much in the mammalian world” as in microbiology and plant sciences, Chance says, citing the sheer complexity of the task as one potential obstacle. Whether a version of the framework will become popular in the genomics of humans and other animals remains to be seen—and if it does, its form may differ markedly from the original.

But for Tettelin, who has watched the transition from single-genome to multiple-genome descriptions of species across the tree of life, the applicability of the pangenomic idea, first laid out by the TIGR team more than a decade ago, continues to broaden. “It certainly put us on the right track,” he reflects. “Sometimes you embark onto things that lead you into dead ends. But in this case, we essentially landed on a highway instead of a dead end. Now we just have to speed up.” 


The classification of species has never been simple. Since the earliest use of the term in a biological context by English naturalist John Ray in the 17th century, the definition of species has been rehashed many times, based variously on criteria ranging from shared physical traits or a capacity to produce viable offspring to a shared niche or evolutionary history. But whichever definition is employed, the boundary between one taxonomic group and the next is not always clear-cut. While a reproductive definition effectively divides most multicellular animals into distinct taxonomic groups, many bacteria, plants, and fungi are much less genetically isolated from one another. Far from offering a neat solution, genome sequencing has revealed the extent of the problem by uncovering dramatic variation within species and surprising overlap between them.

In the face of such complexity, some researchers are developing a more nuanced view. In prokaryotes, where the lines between taxonomic units are fuzziest, pangenome analyses—which partition the genome into core and variable genes depending on their presence or absence among strains or purported species—could offer a more effective way to distinguish closely related organisms than more-traditional approaches. While most current methods compare the sequences of only one or a handful of genes—such as the 16S rRNA gene, or housekeeping genes in the case of multilocus sequence typing (MLST)—to determine relationships between organisms, pangenome analyses compare and contrast whole genomes across multiple individuals, providing an expanded insight into the similarities and differences between organisms.

These methods are already refining biologists’ understanding of bacterial taxonomy. For example, analysis of the core genome in multi-drug resistant Klebsiella pneumoniae revealed that the group comprises two distinct genetic clades (PNAS, 111:4988-93, 2013). And recent analyses of Shigella—one of the leading causes of dysentery—suggest that the bacterium falls into a subgroup of E. coli, rather than forming an independent genus (Front Microbiol, doi:10.3389/fmicb.2015.01573, 2016).

Even in the eukaryotic world, where genomic fluidity is far less pronounced, pangenomic analyses have cast new light on conventional taxonomies. Sequencing of multiple individuals in several groups of eukaryotic organisms, from marine phytoplankton to crop plants, have challenged traditional notions of within-species diversity and raised the question of what “species” even means. If the goal of classification is to create biologically useful groups of like organisms, then such questions are important to resolve—yet for now, at least, they’re still very much open.


  1. H. Tettelin et al., “Genome analysis of multiple pathogenic isolates of Streptococcus agalatciae: Implications for the bacterial ‘pan-genome,’” PNAS, 102:13950-55, 2005.
  2. N.M. Chaudhari et al., “BPGA—an ultra-fast pan genome analysis pipeline,” Sci Rep, 6:24373, 2016.
  3. G.C. Kettler et al., “Patterns and implications of gene gain and loss in the evolution of Prochlorococcus,” PLOS Genet, doi:10.1371/journal.pgen.0030231, 2007.
  4. D. Maione et al., “Identification of a universal group B Streptococcus vaccine by multiple genome screen,” Science, 309:148-50, 2005.
  5. E. Bosi et al., “Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicity,” PNAS, doi:10.1073/pnas.1523199113, 2016.
  6. B.A. Read et al., “Pan genome of the phytoplankton Emiliania underpins its global distribution,” Nature, 499:209-13, 2013.
  7. C.N. Hirsch et al., “Insights into the maize pan-genome and pan-transcriptome,” The Plant Cell, doi.org/10.1105/tpc.113.119982, 2014.
  8. R. Li et al., “Building the sequence map of the human pan-genome,” Nature Biotechnol, 28:57-63, 2010.
  9. C. Alkan et al., “Personalized copy number and segmental duplication maps using next-generation sequencing,” Nature Genet, 41:1061-67, 2009.
  10. S. Besenbacher et al., “Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios,” Nature Commun, 6:5969, 2015.
  11. Y. Liu et al., “Discovery of common sequences absent in the human reference genome using pooled sequences from next-generation sequencing,” BMC Genomics, 15:685, 2014.

Interested in reading more?

Magaizne Cover

Become a Member of

Receive full access to digital editions of The Scientist, as well as TS Digest, feature stories, more than 35 years of archives, and much more!
Already a member?