Supercomputing in the Life Sciences

Photo: Courtesy of Hewlett-Packard/Compaq The world's fastest supercomputer--Japan's Earth Simulator--occupies an area equivalent to four tennis courts on three floors. It contains 5,120 processors, 10 terabytes of memory, 700 terabytes of hard disk space, and it can perform 40 trillion floating point operations every second. The computer's performance exceeds that of the world's 18 next-fastest computers. It also cost an estimated $350 million (US). Japanese scientists use this supercomputer

Sep 2, 2002
Amy Adams
Photo: Courtesy of Hewlett-Packard/Compaq

The world's fastest supercomputer--Japan's Earth Simulator--occupies an area equivalent to four tennis courts on three floors. It contains 5,120 processors, 10 terabytes of memory, 700 terabytes of hard disk space, and it can perform 40 trillion floating point operations every second. The computer's performance exceeds that of the world's 18 next-fastest computers. It also cost an estimated $350 million (US).

Japanese scientists use this supercomputer to model global weather patterns, a monumentally complex problem. Though biological challenges may be less daunting, life scientists are finding uses for supercomputers, too. Researchers use them to search for disease-associated genes and potential drug targets, and to model protein folding, among other applications. Many investigators, in fact, have probably used a supercomputer without even being aware of it: Run a BLAST (basic local alignment search tool) search through the National Center for Biotechnology Information, and you are interacting--if only remotely--with a supercomputer.

Though the critical computational challenges in life sciences are unique, the same key companies that provide hardware to physicists and cosmologists can be found in the genomics and pharmaceutical arenas: IBM, Sun Microsystems, Hewlett-Packard/Compaq, and Silicon Graphics (SGI), to name a few. These companies are moving toward physically smaller, faster machines to hold and manipulate the growing pool of biological data.

THE NEED FOR SPEED ... AND SPACE High-end bioinformatics applications push the envelope for both computational speed and storage space. According to Jack Dongarra, professor of computer science at the University of Tennessee, Knoxville, and co-organizer of the Web site Top500.org that ranks supercomputers, researchers use supercomputers to solve some of their most challenging computational problems. These problems come in two basic forms: some are simply big, involving huge datasets and vast numbers of relatively trivial computations; others are incredibly complicated, such as weather modeling.

The servers and systems that store all those numbers can hold a terabyte or more of information. A terabyte is 1,000 gigabytes, about the same amount of information as can be stored on 50,000 trees' worth of paper. It will be awhile before each lab bay needs this kind of storage capacity, but it is not an unusual amount of data for a large biotechnology company, says Dan Stevens, marketing manager for life and chemical sciences at Mountain View, Calif.-based SGI. "In the last four years we went from [a] terabyte being huge to [it] being something you install just as part of your work," Stevens says. Celera, the Rockville, Md.-based genomics giant, has 110 terabytes of storage in its computer farm; roughly 11 times the amount of information contained in the print version of the US Library of Congress. Researchers at Santa Clara, Calif.-based Sun Microsystems estimate that data generated by the life sciences double every six months, which will quickly bring requirements up into the petabyte--1,000 terabyte--range. All US research libraries combined add up to about two petabytes of data. And just over the horizon lies exabyte-level (1,000 petabytes) storage. To put that number in perspective, according to one researcher, every word ever spoken by human beings equals about five exabytes. (www.cacr.caltech.edu/~roy/dataquan/).

As computers hold more and more information, the processing of that data becomes more difficult. Ask Joseph Borkowski what the driving factor behind computer innovation in the life sciences is, and he has one thing to say: "speed, speed, speed." Borkowski, director of bioinformatics at Pasadena, Calif.-based Paracel, says most bioinformatics applications require computers to churn through relatively small calculations at extremely fast speeds. The eager researcher running a BLAST search wants results as soon as possible, whereas the physicist feeding data into a supercomputer can wait days or even weeks for results.

To achieve this kind of speed, Paracel's GeneMatcher uses stripped-down, custom-designed chips. "It's as though we've dispensed with the parts of the chip that we don't need," Borkowski says. With these lean-and-mean processors, the GeneMatcher can search genetic sequences 1,200 times faster than an off-the-shelf computer, according to Borkowski. What's more, by reducing the chip's size Paracel can fit more processors in a single box--as many as 115,000 in one case--making the whole package smaller. On average, companies want about 10,000 processors and a few hundred gigabytes, though some customers have requested arrays in the terabyte range. Celera, once a top customer of Paracel's, purchased the company in 2000.

UNIQUE OPTIMIZATIONS Another way to achieve speed is to optimize existing algorithms for use on standard processors, so that the jobs run faster. Paracel's BlastMachine runs the Linux operating system on an Intel processor. The company has optimized code for open-source homology-search software such as BLAST, GeneWise, and HMM-Frame (Hidden Markov Model).

SGI has adopted a similar approach. The company's flagship High-Throughput Computing (HTC) environment speeds common applications running on SGI machines. With a cluster of SGI Origin 300 and Origin 3000 supercomputers, users can run HTC-optimized applications, such as searches, at much faster speeds than nonoptimized applications.

Although these SGI Origin servers and supercomputers are the same machines used by other industries, the optimizations are unique to bioinformatics and life sciences applications, says SGI's Stevens. "In bioinformatics there are many important open-source applications," he says, which allow SGI to optimize that code for its machines. For other applications, SGI partners with software vendors to generate optimized programs.

SGI also specializes in making data available to users around the world. The company's CXFS platform allows computers running different operating systems to access centrally stored data at near local speeds. "This flexible data-serving environment is very unique and is becoming big business for us," Stevens says. This capability is particularly important in today's age of international mergers, where storing data in a single location for thousands of users provides big cost savings.

Similarly, the company's Reality Centers allow users around the world to see and manipulate three-dimensional models of potential drug candidates. "SGI has the software to keep data one place, compress it real time, send it over high-bandwidth lines, and make the application fully available for remote users on the other end," says Juli Nash Moultray, biology marketing manager at SGI. She says that in order to make a decision about a potential drug, people from different backgrounds must be able to see the protein interactions in three dimensions. "Advanced visualization technology makes it possible to rotate molecules and to test molecule-molecule interactions, ultimately improving the drug-discovery process," she says.

SUPERCOMPUTING STALWARTS Meanwhile, other companies have focused on the traditional big, fast machines that have advanced the sciences of physics, astronomy, and meteorology. Seattle, Wash.-based Cray, a longtime leader in supercomputing, recently turned its attention to bioinformatics applications by partnering with the National Cancer Institute. In one demonstration, the collaborators produced a map of short tandem repeat markers in a matter of seconds; previously the operation re-quired hours.1

Likewise, anyone who follows chess can attest to the supercomputing muscle of Somers, NY-based IBM's Deep Blue. In naming its latest supercomputing research project "Blue Gene," IBM has acknowledged its major interest in the life sciences. When completed, the machine is expected to operate at speeds as fast as 1,000 teraflops (one quadrillion operations per second), nearly 30 times faster than the Earth Simulator. With hundreds of thousands of processors, all working together seamlessly, Blue Gene will be able to tackle complex digital simulations, such as modeling protein folding.

IBM is developing the first machine in the Blue Gene family, Blue Gene/L, in collaboration with the Lawrence Livermore National Laboratory. The Livermore machine will operate at 200 teraflops (200 trillion operations per second), which is more computing power than the top 500 fastest supercomputers combined.

Blue Gene may be the largest of IBM's supercomputing research projects, but the company has invested in a life sciences business unit that is focused on projects of more relevance to the average biologist. One challenge for the company was learning how biologist's needs differed from those of other IBM clients, says Peter Ungaro, IBM's vice president of high-performance computing. Physicists need a lot of memory to store huge amounts of data for large calculations, but "biologists are concerned about getting data in and out very fast," Ungaro says. "We were used to jobs that would last hours or months. Now we're talking about seconds or minutes."

Moving toward the future, Ungaro says the challenge is taking the human genome project and putting it to use. "How do we take this data from proteomics and functional genomics and bring that into delivering new drugs to the market and advancing personalized medicine," he asks. Because IBM uses its own products in its research labs, Ungaro says the company knows better than anybody what the computing requirements will be for the future--and is developing the faster, better optimized products to prove it.

MEETING CUSTOMER NEEDS Another computing mainstay that has entered the life science market is Palo Alto, Calif.-based Hewlett-Packard/Compaq. According to Daniel Joy, the group's business development manager, Compaq's ties to academic labs combined with Hewlett-Packard's expertise in research and development puts the new company in a good position for growing its life sciences initiative. The company offers products for storage, networking, computer farms, and optimized algorithms, plus supercomputers like the Alpha SC and SuperDome. But the company's real strength, Joy says, comes from applying its know-how to piece these units together into a configuration that will meet a customer's needs. "We've been involved in the field for so long, we can offer a lot of support," he says.

Similarly, Sun Microsystem's basic hardware remains the same across different applications. According to the company's global director of life sciences, Howard Asher, Sun's success lies not in specializing its high-end servers and computers for the life sciences but in working with its software and system-integrator partners to provide complete solutions that include servers, storage, software, and services. Sun's Solaris operating system is known to be robust and reliable, making the company an attractive partner for software vendors. Asher says that Sun's hardware, coupled with a wide array of software products optimized for those machines, allows the company to assemble the best solution for each customer's needs.

For those scientists who do not want to wait for future hardware advances, two companies offer software products designed to speed up heterogeneous computer clusters. New Haven, Conn.-based Turbogenomic's TurboHub algorithm helps these computer clusters parcel bioinformatics tasks more effectively, bringing their speed closer to that of a bona fide supercomputer. Bench scientists can prioritize or terminate jobs being managed by TurboHub, or change which components are part of the TurboHub pool with the TurboBench interface. The company's TurboBlast divides a BLAST search into smaller chunks and then distributes them, enabling available processors to work on the problem in parallel.

Likewise, Crystal Bay, Nev.-based TimeLogic's DeCypher is an accelerator server that speeds up common bioinformatics applications on existing hardware. Using DeCypher, a company can run all its bioinformatics applications on one machine, freeing up other machines for proprietary or more specialized tasks. In the process, it reduces the need to buy additional hardware to support all of the necessary applications.

With a new focus on life science by companies that traditionally brushed off the so-called softer science, biologists can expect to see ever-faster computers, bigger storage devices, and new applications designed especially for their unique requirements. And though most biologists won't need to become part-time information technology professionals, they will continue to have more and more interaction with machines that never made an appearance in their college biology courses.

Amy Adams (amya@nasw.org) is a freelance writer in Mountain View, Calif.

1. J.M. Perkel, "NCI, Cray blaze through genome map," The Scientist, 15[16]:1, Aug. 20, 2001.


BOOTSTRAP SUPERCOMPUTING

In his efforts to understand the mechanisms of protein folding, Vijay Pande, an assistant professor of chemistry and structural biology at Stanford University, looked to the SETI@home project for inspiration. To crunch all of its sky survey data, the extraterrestrial intelligence project recruits home computers to assist it in the form of a screen saver.

Protein folding is a problem of similar magnitude. Proteins fold on the order of milliseconds. To understand how it is happening, you need to "look at every atom at every point in time ... [but] a millisecond is a pretty long time to simulate. Atoms whiz around very quickly--on the femtosecond time scale--and there are a trillion femtoseconds in a millisecond," Pande says. The fastest processors are capable of tracking Pande's algorithm at about a nanosecond a day, so they would require millions of days to complete a calculation.

Through his distributed computing network, folding@home (folding.stanford.edu) Pande harnesses the untapped power of idle computers, allowing him to effectively run 40,000 processors in parallel--exceeding the raw computing power of all the world's supercomputing centers combined. Calculations are sent sporadically from home computers to the central server for processing. This approach allows Pande to model the dynamics of a protein with 40 to 50 amino acids in just one day.

But distributed computing is no simple exercise, Pande warns. It requires an experienced software engineer to handle data flowing from a large number of computers and arriving at different times. Also, the program must not be intrusive, says Christopher Hogue, a biochemist at the Samuel Lunenfeld Research Institute, part of Mt. Sinai Hospital in Toronto, Ontario. "When you're using other people's computers, you have to be very nice to them, otherwise they won't come back."

Like Pande, Hogue studies protein folding, using both distributed computing (www.distributedfolding.org) and Beowulf clusters (bioinfo.mshri.on.ca/yac/). In a Beowulf cluster, many computers--usually heterogeneous--are linked together to perform calculations in parallel; each computer, or node, becomes the equivalent of a single CPU in a supercomputer. These clusters do not suffer from the slow and complicated communication of a distributed computing network, but they also cannot realistically achieve the same processing punch. Pande says they typically top out at two or three orders of magnitude below what he has achieved.

Both researchers agree that the two types of bootstrapped supercomputers are complementary rather than competitive, but they also agree that distributed computing has by far the most potential. As of now, which method is better "depends on your algorithm," says Pande. "Some algorithms, people really believe, require fast communication, and parallel molecular dynamics [which Pande uses] used to be one of those ... we did something that people thought was impossible. I think it's interesting to reexamine some of those problems, especially since the computational power [of a cluster] will never approach what you can get by distributed computing."

Hogue has also moved on to distributed computing, gaining 10 to 20 times the computing power. But he reiterates the challenges that distributed computing presents software engineers. "You have to build server infrastructure, set up a framework for people to log in, collect data, do statistics on the data ... that's why there are only a few labs doing this sort of work."

--Jim Kling