Abundant Sequence Errors in Public Databases

A new algorithm reveals hoards of preparation-induced DNA mutations in publicly available human sequences.

By | February 16, 2017

 

FLICKR, SAURI NASHSome sequence variants found in DNA specimens may actually be caused by damage during sample processing, according to a paper in Science today (February 16). A team of researchers at New England Biolabs (NEB) has devised an algorithm for assessing the degree of such damage, and suggests that using DNA repair enzymes during sample preparation might rectify the problem.

“The work demonstrates how to distinguish somatic variants from those due to DNA preparation damage,” Stanford University’s Stephen Montgomery, who was not involved with the work, wrote in an email to The Scientist. “The benefits of this [include] reduced false positives . . . in discovery-based cancer genome projects,” he added.

It is well known that DNA samples extracted from ancient specimens or from formalin-fixed, paraffin-embedded tissues are prone to fragmentation and chemical modification, which can produce mutations that did not exist in the living organism. But recent evidence suggests that, in fact, any DNA sample may be at risk of such artificial mutagenic damage. DNA sonication—the use of sound energy to agitate the DNA fragments in preparation for amplification and sequencing—is known to induce oxidative damage that introduces mutations.

Such mutations occur only rarely within a sample and so, in many instances, are not problematic. But in cancer biology, explained molecular oncologist Marc Ladanyi of Memorial Sloan Kettering Cancer Center in New York who was not involved with the work, “there is an increasing emphasis on [identifying] sub-clonal mutations [as well as] detecting mutations in free tumor DNA in the plasma,” both of which may be present in only a very small proportion of cells in the sample.

“[When dealing with variants at] such low allele frequencies, this artifact is a genuine concern,” Ladanyi said, “and the paper is a good reminder that the artifact needs to be guarded against.”

Laurence Ettwiller and fellow researchers at NEB in Ipswich, Massachusetts, have now devised an algorithm that calculates the extent of such damage in a sequenced DNA sample. The algorithm makes use of the fact that the oxidative damage of DNA during sonication converts guanine to 8-oxoguanine, which appears and acts like a thymine during sequencing reads. Comparing the sequencing reads of the two complementary strands, these converted guanines can be spotted as mismatches: one strand reads out thymine, but the complimentary strand reveals a partnered cytosine (which pairs with guanine). Naturally occurring guanine-to-thymine variants, on the other hand, would have thymine’s natural partner adenine. The algorithm thus compares the first and second sequencing reads to reveal the degree of mismatching (or imbalanced) thymines to determine the amount of damage.

When applied to sequences in the 1000 Genomes and The Cancer Genome Atlas databases, the algorithm—called Global Imbalance Value (GIV)—determined that 41 percent of the 1000 Genomes datasets had an imbalance score indicative of damage, while 73 percent of those in The Cancer Genome Atlas showed extensive damage.

“The damage is more prevalent than we would have expected,” said NEB’s Thomas Evans, who co-authored the study. Such errors would be likely to confound the identification of true low-frequency somatic variants, he said.

On a positive note, said Ettwiller, “one thing that people can do is to look at samples that they have and flag ones that are too damaged”—essentially, use the GIV algorithm, which is freely available on GitHub, as a quality control step. The GIV score of a sample could also be used as a guide to set stringent thresholds for identifying potentially genuine low-frequency variants.

In addition, the authors suggest a way to rectify the damage before sequencing takes place. When a mix of DNA repair enzymes was added to the DNA sample during preparation, the oxidation damage was fixed, they reported.

“[The paper] provides a technical solution, which is repairing the DNA with this enzyme cocktail,” said Ladanyi. But, he noted, “the authors are from NEB and the solution to the problem is to use the NEB repair kit, so there’s an intrinsic conflict of interest.”

To that point, Ettwiller said that while the team did use NEB enzymes to fix their own damaged DNA samples, they are not asserting it would work for all DNA preparations.

“We do sell that mix for repairing DNA upstream of sequencing, but we don’t want to make any grandiose claims; that’s not how NEB works,” Evans said. “We’re continuing to evaluate it.”

L. Chen et al., “DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification,” Science, 355: 752-56, 2017.

Add a Comment

Avatar of: You

You

Processing...
Processing...

Sign In with your LabX Media Group Passport to leave a comment

Not a member? Register Now!

LabX Media Group Passport Logo

Comments

Avatar of: spritrig

spritrig

Posts: 6

February 18, 2017

There are also genotyping errors. 

Avatar of: BParker

BParker

Posts: 1

February 23, 2017

Hmmm... The descriptive heading has an interesting inference: "hoards of .. mutations" suggests secret stores of mutations just waiting to be unleashed on an unwary public. Could the author have meant this?

On the other hand, "hordes..of mutations" might suggest large groups of people celebrating their mutations.

 

 

Popular Now

  1. So You’ve Been Mistaken as a White Nationalist
  2. Opinion: We Need a Replacement for Beall’s List
  3. Trump Releases Science Spending Priorities for FY2019
  4. Seeding the Gut Microbiome Prevents Sepsis in Infants
AAAS