Getting to Know the Genome

A massive project involving hundreds of scientists suggests that very little—if any—of the human genome is truly non-functional.

Sep 5, 2012
Ed Yong

In 2001, the Human Genome Project produced a near-complete readout of the human species’ DNA. But researchers had little idea about how those As, Gs, Cs, and Ts were used, controlled, or organized, much less how they code for a living, breathing human.

That knowledge gap has just got a little smaller. A massive international project called ENCODE, the Encyclopedia of DNA Elements, has cataloged every nucleotide within the genome that does something—which, it turns out, is significantly more than the 1.5 percent of the genome contains actual instructions for making proteins. The research, a 10-year effort by an international team of 442 scientists, shows that the rest of the genome—the non-coding majority—is still rife with “functional elements.”

“The genome is no longer an empty vastness,” said Shyam Prabhakar from the Genome Institute of Singapore, who was not involved in the study. “It is densely packed with peaks and wiggles of biochemical activity.”

“Almost every nucleotide is associated with a function of some sort or another, and we now know where they are, what binds to them, what their associations are, and more,” added Tom Gingeras, one of the studies’ many senior scientists. The results are published today (September 5), in more than 30 papers across many different journals.

Researchers have long recognized that some non-coding DNA probably has a function, and many solid examples have recently come to light. At the same time, people did believe that much of these sequences were, indeed, junk. The ENCODE project suggests otherwise.

The researchers found that many non-coding parts of the human genome contain docking sites where proteins can bind, affecting the expression of both nearby and distant genes . Other non-coding regions are transcribed into RNA molecules that are never translated into proteins. Still others affect how the DNA is folded and packaged. In sum, these regions are not just junk; according to ENCODE’s analysis, 80 percent of the genome has some biochemical function.

The remaining 20 percent may not be junk either, according to Ewan Birney, the project’s Lead Analysis Coordinator. He explains that while ENCODE looked at 147 different types of cells, there are a couple of thousand in total. If other cell types are examined, functions may emerge for the phantom proportion. “It’s likely that 80 percent will go to 100 percent,” Birney said. “We don’t really have any large chunks of redundant DNA. This metaphor of junk isn’t that useful.”

The implications are vast, from redefining what a “gene” is to providing new clues in the quest to understand diseases and how the genome works in three dimensions. “There are nuggets for everyone here,” Prabhakar said. “No matter which piece of the genome we happen to be studying in any particular project, we will benefit from looking up the corresponding ENCODE tracks.”

Of course, there’s still a long way to go, Birney noted. “I think it’s going to take this century to fill in all the details,” he said. “That full reconciliation is going to be this century’s science.”

By the numbers

Researchers already knew that 1.5 percent of the genome codes for proteins. ENCODE found that an additional 8.5 percent codes for regions where proteins stick to DNA, presumably regulating gene transcription. And, because ENCODE hasn’t looked at every possible type of cell or every possible protein that sticks to DNA, this figure is likely conservative. Birney estimates that the total proportion of the genome that either creates a protein or sticks to one is around 20 percent.

The rest of the functional elements in the ENCODE analysis cover other classes of sequence that were thought to be essentially functionless, including introns. “The idea that introns are definitely deadweight isn’t true,” said Birney. Even some repetitive sequences—small chunks of DNA that have the ability to copy themselves and are typically viewed as parasites—are likely to be functional, often containing sequences where proteins can bind to influence the activity of nearby genes. Perhaps their spread across the genome represents not the invasion of a parasite, but a way of spreading control. “These parasites can be subverted sometimes,” Birney said.

Birney expects that many skeptics will argue about the exact proportion—the 80 percent of the genome that ENCODE estimates to be doing something—and about the definition of “functional.” But, he said, “no matter how you cut it, we’ve got to get used to the fact that there’s a lot more going on with the genome than we knew.”

What’s in a gene?

The simplistic view of a gene is that it’s a stretch of DNA that is transcribed to make a protein. But with ENCODE’s data, this definition no longer makes sense. There are a lot of transcripts, probably more than anyone had realized, some of which connect two previously unconnected genes. This means that the boundaries for those genes have to widen, and the gaps between them shrink or disappear.

Gingeras says that this “intergenic” space has shrunk by a factor of four. “A region that was once called Gene X is now melded to Gene Y,” he says. With such blurring boundaries, Gingeras thinks that it no longer makes sense to think of a gene as a specific point in the genome, or as its basic unit. Instead, that honor falls to the RNA transcript. “The atom of the genome is the transcript,” says Gingeras. “They are the basic unit that’s affected by mutation and selection.”

New disease leads

For the last decade, geneticists have run a seemingly endless stream of genome-wide association studies (GWAS), and have thrown up a long list of single nucleotide polymorphisms (SNPs) that correlate with the risk of different conditions. The ENCODE team has mapped all of these GWAS-identified SNPs to their data.

The researchers found that just 12 percent of known SNPs lie within protein-coding areas. They also showed that compared to random SNPs, the disease-associated ones are 60 percent more likely to lie within the non-coding but functional regions that ENCODE identified, especially in promoters and enhancers. This suggests that many of these variants are controlling the activity of different genes, and provides many fresh leads for understanding how they affect our risk of disease. “It was one of those too good to be true moments,” said Birney. “Literally, I was in the room [when they got the result] and I went: Yes!”

The ENCODE researchers also found new links between disease-associated SNPs and specific DNA elements. For example, they found five SNPs that increase the risk of Crohn’s disease, and that are recognized by a group of transcription factors called GATA2. “That wasn’t something that the Crohn’s disease biologists had on their radar,” Birney said. “Suddenly we’ve made an unbiased association between a disease and a piece of basic biology.”

“We’re now working with lots of different disease biologists looking at their data sets,” he added. “In some sense, ENCODE is working from the genome out, while GWAS studies are working from disease in.” So far, the team has identified 400 such hotspots that are worth looking into.

The 3-D genome

Writing the genome out as a string of letters invites a common fallacy: that it’s a two-dimensional, linear entity. In reality, DNA is wrapped around proteins called histones like beads on a string. These are then twisted, folded and looped in an intricate three-dimensional way. In this way, distant parts of the genome can actually be physical neighbors, and can affect each other’s activity.

Job Dekker, a bioinformaticist at University of Massachussetts Medical School,used ENCODE data to map these long-range interactions across just 1 percent of the genome in three different types of cell, and discovered more than 1,000 of them. “I like to say that nothing in the genome makes sense, except in 3D,” said Dekker. The availability of the new ENCODE data is “really a teaser for the future of genome science,” he added.

Sharing the data

The new ENCODE results are vast, reported in 30 central papers in Nature, Genome Biology, and Genome Research, as well as a slew of secondary articles in Science, Cell, and others. And all of the data are freely available to the public.

The pages of printed journals are a poor repository for such a vast trove of data, so the ENCODE team have devised a new publishing model. On the ENCODE portal site, readers can pick one of 13 topics of interest, such as enhancer sequences, and follow them in special “threads” that pull out all the relevant paragraphs from the 30 main papers. “Rather than people having to skim read all 30 papers, and working out which ones they want to read, we pull out that thread for you,” Birney said.

The team has also built what they call a Virtual Machine, a downloadable program that includes all the code that the ENCODE scientists used to analyze their data. Any researcher can download almost-raw data and reproduce any of the analyses in the papers by themselves. It’s the ultimate in transparency.

“With these really intensive science projects, there has to be a huge amount of trust that data analysts have done things correctly,” said Birney. With the virtual machine, “you can absolutely replay, step by step, what we did to get to the figure. I think it should be the standard for the future.”