NCI, Cray Blaze Through Genome Map

Late last year, while watching a news report that examined the challenges of annotating the human genome sequence, Bill Long had an epiphany. A programmer at Seattle-based Cray Inc., Long realized that many bioinformatics complexities are basically problems of pattern matching--a Cray specialty. So, Long phoned the National Cancer Institute's (NCI) Advanced Biomedical Computing Center (ABCC) and suggested a collaboration. He was well received. Says ABCC director Stan Burt, facility researchers "

Aug 20, 2001
Jeffrey Perkel
Late last year, while watching a news report that examined the challenges of annotating the human genome sequence, Bill Long had an epiphany. A programmer at Seattle-based Cray Inc., Long realized that many bioinformatics complexities are basically problems of pattern matching--a Cray specialty. So, Long phoned the National Cancer Institute's (NCI) Advanced Biomedical Computing Center (ABCC) and suggested a collaboration. He was well received. Says ABCC director Stan Burt, facility researchers "pride [themselves] on being able to translate biological problems into computational [ones]."

The key to Long's idea centered on four pieces of hardware found inside each Cray computer. Distinct from the computer's central processing unit, these pieces, when acting in concert, are extraordinarily efficient at pattern matching. This hardware and its associated instruction set were first developed for the intelligence community. Most other users of these computers neither know nor care that this hardware exists, says Steve Conway, Cray's corporate communications vice president and head of bioinformatics.

The ABCC already owned millions of dollars worth of high-powered computing equipment, including systems developed by IBM, Compaq, Sun, and Silicon Graphics Inc. These computers aid scientists working on phenomenally complex computational problems, but each has its strengths and weaknesses. None had high-speed, pattern-matching capabilities--except for one.

Better Use of Existing Hardware

In 1999, the ABCC acquired a Cray SV1 supercomputer with 96 processors on four nodes, 96 gigabytes of memory, and a theoretical processing speed of 115 billion floating-point operations per second (gigaFLOPS), which the ABCC obtained in a three-year, $6.5 million lease. But the ABCC lacked the software necessary to solve intense pattern-matching problems. Long's phone call enabled the center to truly harness the power it already had.

The initial results from this collaboration were announced July 9 and suggested that Cray computers could provide the horsepower necessary for comprehensive analyses of sequence landmarks. The collaborators designed software that could, for the first time, generate an exhaustive map of every short tandem repeat (STR), between two and 16 nucleotides in length in the entire human genome--without prior sequence knowledge. Remarkably, this analysis, which had previously been impossible, took less than 10 minutes to complete, according to ABCC's Burt. Researchers must now determine how generally applicable this approach is to other bioinformatics problems.

The impact of this work is a matter of debate; the project is to some extent a "proof-of-concept" test. Competing algorithms, including one from Gary Benson at the Mount Sinai School of Medicine in New York, are also quite fast, though they adopt statistical, rather than exhaustive, approaches. Grant Heffelfinger, deputy director for new initiatives at the Sandia National Laboratory (SNL) Computers, Computation, and Mathematics Center, questions whether blazingly fast pattern-matching algorithms will be necessary in a few years time, as researchers move from questions of genome annotation toward issues of functional genomics, proteomics, and beyond.

The particular problem being considered at the ABCC was STR mapping. STRs are genomic landmarks that can be used as markers for disease. Polymorphisms in the repeats may even cause certain diseases. According to Benson, a computer scientist in Mount Sinai's department of biomathematical sciences, tandem trinucleotide repeat polymorphisms are associated with rare neurological diseases such as Huntington's disease, Friedrich's ataxia, and fragile X mental retardation. Furthermore, Benson adds that longer repeats (up to 18 nucleotides in length) have been linked to forms of diabetes, epilepsy, and cancer. Because they are polymorphic in nature, tandem repeats are useful in forensic science as well.

Cray's Conway says that typical bioinformatics algorithms approach problems, such as STR mapping, using statistical sampling rather than exhaustive, methods. In a statistical sampling approach, relatively large regions are examined in aggregate to produce a score. If this value is above some threshold value, then the region is examined in greater detail. If not, it is discarded. In contrast, exhaustive approaches examine each residue, one at a time. The disadvantage of statistical approaches is that important elements can be missed if they are in a region that fails to produce an adequate score. Although exhaustive approaches do not suffer from this drawback, they are usually computationally prohibitive. Nevertheless, according to Conway, initial benchmark comparisons of the SV1 to the ABCC's other supercomputers showed that, when using software written to take advantage of the computer's pattern-matching capabilities, the Cray was up to 2,000 times faster than otherwise comparable computers. Thus, a problem that ordinarily took days could be finished in minutes.

Statistical approaches to STR mapping had already been developed. Benson, for example, has developed fast, publicly accessible tandem repeat detection, available on the Internet at c3.biomath.mssm.edu/trf.html.1 But exhaustive analysis of chromosome 22 for pentanucleotide STRs was pushing the ABCC's computers to their limits, so the effort was abandoned when the analysis became too lengthy, according to Burt.

When the same task was run on the SV1, however, it was completed in two seconds. With such a rapid response, the investigators then turned to the entire 3.2 billion nucleotide-long human genome sequence. The computer returned a result in 130 seconds. This problem, says Burt, was solved using a fraction of the computer's computational muscle. The ABCC's SV1 has 96 processors, but only eight were devoted to the problem. Burt notes that two features of the Cray platform contributed to its efficiency with the STR map. These were the computer's pattern-matching functionality, and its optimized vector-processing capabilities, which allow the researcher to perform the same operation on many elements in an array at the same time.

The Cray's pattern-matching functionality can detect patterns without prior knowledge of the pattern sequence. That is, STR analyses typically begin when, for example, a researcher directs the bioinformatics software to find all GC repeats in a given sequence. In contrast, the NCI's software can locate unidentified STRs. Although this is also true of Benson's algorithm, his employs a statistical approach.

More Widely Applicable?

The next challenge for ABCC and Cray researchers is to determine whether the SV1's speed advantage in this particular instance is generally applicable to other bioinformatics problems. These include sequence assembly, EST clustering, protein class recognition, and detecting intron and exon boundaries.

Clearly this approach will not work in every instance; it will function effectively only on those problems with a pattern-matching component. Computational problems such as protein folding would be just as laborious on the SV1 as on any other comparable supercomputer.

The SNL's Heffelfinger says it is not clear how long companies will devote the vast majority of their computing cycles to pattern-matching questions. He notes that companies are beginning to move from issues of genome annotation to questions of functional genomics (such as clustering, in which related genes are identified based on similar behavior in gene expression studies, and signal processing for mass spectrometry). These problems are less dependent on fast pattern-matching algorithms and would therefore gain less of a boost from the Cray hardware. However, Jack Collins, an ABCC programmer working on the Cray-NCI collaboration, counters that the pattern-matching questions addressed on the Cray will continue to be of importance. New sequence information is produced at a staggering rate, and must be analyzed quickly to be of the greatest use to biologists. "Science and technology move rapidly, and we must constantly be updating our hypotheses with the latest information," he says.

Now that the STR map has been completed, work is under way to place the data in a publicly accessible database. NCI biologists have already begun to make new discoveries, which they declined to discuss. Interestingly, they now have a new and unexpected problem: there is too much data. Says Burt, "One of the things that we have encountered with this is to try to make sense of the tremendous amount of data that you get." Basically, the completed map is so data-rich that researchers need to find ways to filter it to find items of interest. As Burt puts it, "how do you wrap yourself around this [data] to get a global picture?"

Jeffrey M. Perkel can be contacted at jperkel@the-scientist.com.
References
1.G. Benson, "Tandem repeats finder: A program to analyze DNA sequences," Nucleic Acids Research, 27[2]:573-80, 1999.