TAMAR SOFER, UNIVERSITY OF WASHINGTON
Just a decade ago, epigenetics researchers used classic biochemistry to reveal key modifications involved in the control of gene expression. These days, discoveries in epigenetics are as likely to be made with a computer as they are to rely on freezers full of cells or stacks of petri dishes. Researchers working to understand the intricacies of methylation marks, histone patterns, and chromosome structure must use computational approaches.
“Given the way the field is moving, if you do high-throughput experiments of any kind, whether they’re genetic or epigenetic, computation is just part of the pipeline,” says William Noble, a professor of genome sciences and of computer science at the University of Washington School of Medicine, and one of the computer-scientists-turned-epigeneticists who are developing new tools—and making new discoveries—to advance epigenomics.
Sometimes bioinformaticians collaborate with biologists, helping create or fine-tune analysis software to deal with unique data sets. Other times, they’re building tools that they think are needed in the field, or putting their programs to the test by parsing data that are publicly available in existing databases. Overall, the aim is the same: easier analysis of the scads of data that can come out of even the simplest epigenetics experiments.
“In epigenetic research, we’re banking a lot on the fact that by doing many samples at once, we can find patterns. But many times, the differences we are looking for are very small,” says Andrea Baccarelli, a professor of environmental epigenetics at Harvard University.
The Scientist surveyed researchers who have developed freely accessible tools for the data-swamped epigeneticist. Here’s what they said about what these tools have to offer.
MINING FOR METHYL CLUSTERS
Researcher: Andrea Baccarelli, Professor, Department of Environmental Epigenetics, Harvard School of Public Health
The Challenge: Over the past few decades, bioinformaticians and computational researchers have developed a plethora of tools for aligning gene sequences or finding patterns and differences in large genomics data sets. But the programs are mostly designed to compare data from multiple individuals, not to pinpoint differences within a single person’s body.
“There were not tools that interpret based on tissue,” says Baccarelli, “because in genetics you don’t usually care about which tissue a sample came from.”
For epigenetics, though, that can make all the difference—epigenetic marks vary not only between different tissues and organs, but even between individual cells. So when Baccarelli set out to characterize tissue-specific differences in a nationwide study correlating epigenetic marks with where people live, he had to start from scratch.
“The microarray we are using for this interrogates a half million methylation sites,” he says. “That poses a challenge of statistical power.”
The Solution: Baccarelli’s group designed a program that generated an automated comparison of nearby methylation sites in the genome. If two neighboring sites are methylated under the same conditions, they’re grouped into a cluster. Once researchers have identified clusters in which, at any given time, all the sites are either methylated or unmethylated, they can check whether other data sets follow the same patterns. “When you have a new epigenetic signal, it can be very hard to understand whether what you find has biological significance,” he says. Using A-Clustering, however, researchers can focus on clusters of marks, rather than individual ones.
Baccarelli’s team tested the new tool on a set of epigenomics data from 80 men in the U.S. who had either low or high exposure to pesticides. The presence and size of epigenetic clusters, they found, was correlated with pesticide exposure (Bioinformatics, 29:2884-91, 2013). Armed with that initial result, biologists can now investigate what mechanisms link pesticides to epigenomic alterations.
Researcher: Jason Ernst, Assistant Professor, Departments of Biological Chemistry and Computer Science, University of California, Los Angeles
Challenge: If you’re setting out to map a region of DNA, it’s helpful to know whether other researchers have previously examined it and what they have found. But sifting through years of literature and trying to pinpoint the methods and results of other labs can be frustrating and time-consuming. Ernst set out to streamline this process.
Solution: Ernst’s team created ChromImpute, a tool that can mine existing data sets to create a compendium of epigenetic marks. Like A-Clustering, it also determines whether marks are correlated with each other. The current version of the program is designed to rely on data from a few sources, including the NIH Roadmap Epigenomics and ENCODE (Encyclopedia of DNA Elements) projects, to create the atlas (Nature Biotechnol, 33:364-76, 2015).
“The idea of ChromImpute is that one can predict what some of these epigenetic data sets should look like based on existing data,” explains Ernst. “Then, even once you’ve done an experiment, you get greater statistical robustness from combining your new data with this existing compendium.”
The collections created by Chrom-Impute can be used in a variety of ways: if you know that two marks are always correlated, you can infer the state of one mark by only measuring the other. Or, you can use ChromImpute’s predictions for quality control. “This could help flag data sets that don’t match up with existing data,” Ernst says.
In addition, if you’re under constraints of budget, time, or sample size, the Chrom-Impute data can guide your selection of what marks to map in the first place.
LIFE CYCLE EPIGENOMICS
Researcher: William Noble, Professor, Departments of Genome Sciences and Computer Science, University of Washington School of Medicine
Challenge: Unlike DNA sequence data, which is a straightforward string of letters representing nucleotides, epigenetic data can take many forms. Information can be derived from studying histone modifications, RNA-seq transcriptomes, chromatin accessibility screens to gauge the three-dimensional structure of a genome, or methyl groups on nucleotides. “If you want to try to make sense of all that at once, you can put it in a genome browser and look at it, but it’s really hard to wrap your head around it by just staring at it on a screen,” says Noble. So a few years ago, Noble and collaborators set out to develop an automated program that could line up all these data and look for trends.
Solution: What they came up with was Segway, a semi-automated pattern-finding tool (Nature Methods, 9:473-76, 2012). It’s semi-automated, Noble explains, because a human still has to interpret many of the labels the program assigns to segments of the genome. But it also takes a machine-learning approach. That means that over time, the program gets better at looking for the kinds of patterns you ask it to find. Segway integrates different pieces of data—such as histone modifications, transcription-factor binding sites, and locations of open chromatin—to find the patterns. These could point toward regions such as gene enhancers or, Noble says, structural entities within DNA that haven’t even been identified and named yet.
COURTESY OF WILLIAM NOBLE
“The idea is that if you can recognize, ‘Hey, here are the same patterns across different experiments,’ then there’s a good likelihood that they’re biologically significant patterns,” says Noble.
When Noble’s group used Segway to analyze data collected by researchers from around the globe as part of the ENCODE Consortium, they were able to pick out elements including promoters and enhancers (Nature, 489:57-74, 2012). The annotation will help researchers home in on markers relating to health and disease, Noble says. One research group, for instance, has recently turned to Segway to help identify the difference between genetic variants that cause disease and those that are relatively less pathogenic (Nature Genetics, 46:310-15, 2014). The results let them prioritize which single-nucleotide changes in the genome should be followed up on with more research.
Like many of the methods discussed here, Segway is likely too complex for most biologists to use without input from a computational researcher like Noble. But he says that’s mostly because some of the early steps in data crunching—such as getting all the data into the formats the program can accept—still have to be done manually. That may soon change. “I’m hoping in the next year the field will coalesce on methods for some of the nitty-gritty details, and end users will no longer have to worry about that part,” he says. “That would be a big step for the field.”
BRINGING EPIGENOMICS TO THE CLINIC
© CHRISTOPH BOCK, CEMM & MAX PLANCK SOCIETY, 2015.Tool: MethMarker
Researcher: Christoph Bock, Principal Investigator, CeMM Research Center for Molecular Medicine, Austrian Academy of Sciences; Adjunct Group Leader, Max Planck Institute for Informatics
Challenge: For clinical scientists who want to introduce epigenetic tests in medicine—to discriminate tumors that might respond to particular treatments, for example—the computation can be especially hard to parse.
The Max Planck Institute for Informatics has churned out half a dozen tools for managing and analyzing epigenetic data. One of its latest—a software program that pinpoints epigenetic alterations that could serve as clinical biomarkers—is aimed at meeting this clinical need.
Solution: Bock helped spearhead the development of MethMarker, a tool with multiple functions. It can inform the design of assays and appropriate primers for experiments including combined bisulfite restriction analysis, bisulfite single-nucleotide primer extension (SNuPE), or methylation-specific PCR. After you have a data set, though, MethMarker becomes even more powerful.
MethMarker allows you to input data from individual patients to identify epigenetic markers that might predict drug response. The program pinpoints which, if any, markers hold the most promise for screening, and which assays to use to distinguish them (bisulfite restriction versus methylation PCR, for instance). The tool, Bock says, can help ease the bottleneck between basic research and clinical application that has limited how epigenetics can be used in medicine.
To show the utility of MethMarker, Bock’s group applied the program to the MGMT gene, thought to be hypermethylated in a number of cancers. Patients whose tumors have the hypermethylated gene, previous studies have found, have lower survival rates, but are also more responsive to certain therapies. But no clinically available test can detect the presence of the hypermethylation. MethMarker not only grouped sample tumors into the hypermethylated or nonhypermethylated varieties, but also parsed those groups to find which particular methyl marker was key to distinguishing them (Genome Biol, 10:R105, 2009).