Shotgun sequencing comes of age

sequence helps settle debate over shotgun versus clone-by-clone sequencing.

By | December 31, 2002

With little fanfare, the much-debated sequencing method known as whole-genome shotgun (WGS) has become a conventional way to sequence genomes. Two studies out this month help to confirm its importance.

Early this month, the publicly funded mouse genome project showed that WGS could yield a high-quality draft sequence — one superior to the first draft of the human genome. And in a paper published December 23 in Genome Biology (a publication of The Scientist's partner, BioMed Central), Susan Celniker and colleagues report that the WGS method has produced a Drosophila sequence approaching the standard that the US National Human Genome Research Institute (NHGRI) has set for finished sequence — less than 1 error per 10,000 base pairs. This third, and "finished", version of the Drosophila genome — which was the first metazoan genome sequenced predominantly by the WGS method — now averages 1.09 errors per 10,000 base pairs.

"The study seems to answer one of the initial criticisms of WGS, that the finishing stage would be more difficult. Turns out it is not," S. Blair Hedges, who works on vertebrate genome evolution at Pennsylvania State University, told The Scientist.

WGS has been around for two decades, but became controversial when Celera Genomics announced it would use the method to produce a draft human genome sequence faster than the publicly funded Human Genome Project. The latter relied heavily on a different method, usually known as clone-by-clone.

Eric Lander of the Whitehead Institute, one of the NHGRI-funded sequencing centers, blames the controversy in part on journalists. "The WGS versus clone-based sequence issue was so muddled by the press during the Human Genome Project," he told The Scientist. Journalists presented the debate as an argument over whether WGS would work, although researchers, he said, always agreed that it would work for the draft sequence. The issue was whether it would be the best route for getting to finished sequence. "It was a cost–benefit argument," he said — an argument that has not yet been resolved. "The only way to know is to measure the cost of finishing a mammalian genome both ways. The experiment has been done for human by clone-by-clone, but not for WGS."

Draft sequences are useful for many purposes, but finished sequences are essential for identifying the full set of genes and regulatory regions and getting the correct sequence of proteins, Lander pointed out. "Without this, you can't know what you're missing, what apparent genes may be non-functional pseudogenes. You also cannot study repeat sequences accurately. And it is much harder to spot new mutations."

Finished sequence also permits verification and error correction, and completes fragmented and fragmentary genes, according to Mark Blaxter, of the Institute of Cell, Animal and Population Biology in Edinburgh, UK. The completed Caenorhabditis elegans genome, he told The Scientist, was 100.3Mb in size and contained more than 1000 additional protein-coding genes compared to the 97Mb first draft of 1998.

Lander argues that it is possible to finish a shotgun sequence in organisms with few repeats, like bacteria or even Drosophila, with 3% repeats in euchromatic regions. "But the evidence suggests that clone-by-clone sequencing is required for organisms with major repeats." That is particularly true of the human sequence, in which 50% of the genome is repeats and, more importantly, 5% represents nearly exact duplication, he said.

WGS smashes a genome into millions of bits, sequences the bits, and localizes each one to a specific spot in the genome by matching genetic markers in the bit to the same markers on chromosomes. The clone-by-clone method breaks a genome into largish chunks, clones the chunks into bacterial artificial chromosomes (BACs), breaks the BAC DNA into smaller chunks, matches their end sequences via computer programs, and then localizes them in the genome with markers. It takes longer than WGS and means sequencing thousands of BACs many times to map a genome, but it has been regarded as more accurate. WGS requires millions of sequence reads, too, but is believed to be less expensive.

After all the high-profile discord, researchers have come quietly to a consensus on sequencing: they want the best of both approaches. Today's genome projects tend to combine the two into hybrid strategies that are shaped by the complexity of the genome under study and the way researchers are likely to use the sequence information.

"We think shotgun sequencing is enough, certainly for organisms whose sequence will be used primarily for comparative genomics studies," Susan Celniker told The Scientist. Celniker is co-director of the Berkeley Drosophila Genome Project Sequencing Center at the University of California, Berkeley. "BAC end-sequence is a necessary component of the whole-genome shotgun-sequencing strategy to build large sequence scaffolds, and the BAC fingerprints are essential to verify the assembly, but making thousands of BAC subclone libraries is unnecessary." Both zebrafish and rat are being sequenced using hybrid strategies with a whole-genome shotgun component, she pointed out. But, she added, "As the publicly available sequence-assembly algorithms and software improve, I think we will see the end of the hybrid strategies."

"I believe that shotgun sequencing is great, but it doesn't give an accurate account of the 40% or so of the genome of mammals that is repetitive. So, there is definitely a need for the sequencing of BAC clones to finish the job," said Haig Kazazian, who chairs the genetics department at the University of Pennsylvania and studies retrotransposons. "Speciation, aging and other key processes may be affected by repeats," he told The Scientist.

Celniker and colleagues reported that, for some repeats, neither WGS nor clone-by-clone works particularly well. Large tandem repeats such as the histone cluster in Drosophila are not solved by any simple method, and even in smaller tandem arrays the number of copies is being estimated based on sizing fragments using Southern blots, she said.

Financial calculations continue to drive individual decisions about which methods to use for which projects. "Everyone in genomics agrees that more sequence and better sequence is better," Blaxter said. But only a limited number of bases can be sequenced in a year. The result, he said, is a continual tug-of-war between those who would like to see exhaustive completion of one genome and those who would rather have draft sequences of several.


Follow The Scientist

icon-facebook icon-linkedin icon-twitter icon-vimeo icon-youtube

Stay Connected with The Scientist

  • icon-facebook The Scientist Magazine
  • icon-facebook The Scientist Careers
  • icon-facebook Neuroscience Research Techniques
  • icon-facebook Genetic Research Techniques
  • icon-facebook Cell Culture Techniques
  • icon-facebook Microbiology and Immunology
  • icon-facebook Cancer Research and Technology
  • icon-facebook Stem Cell and Regenerative Science
Mettler Toledo
BD Biosciences
BD Biosciences