Minds Must Unite
Making Biological Computing Smarter
The Future Look of the Sanger Institute
The air at the Wellcome Trust Sanger Institute in Hinxton, near Cambridge, fairly hums with electricity. At the end of a long corridor, on the other side of a set of double doors in what's known as J-block lies the Institute's data center, the brains of a vast bioinformatics operation. Within, a loud voice and a careful tread are useful – one to be heard above the drone of the machines, the other to avoid the streams of cold air that billow up from vents in the floor.
The drone wasn't quite so loud when Sanger opened 12...
Sanger's IT systems have had to keep up with exponential increases in output from its sequencing instruments. The Institute, which had a €300 million budget for 2001–2006, is perhaps best known as the European operations center for the public human genome sequencing initiative, contributing some 1.2 billion bases so far. And that's just the human sequence; according to Sanger statistics, the Institute has produced some 3.9 billion bases in total from nearly 100 different species.
Sanger's sequencing center currently includes 70-plus ABI 3730 capillary units running 96 sequences simultaneously, 20 times a day. Those machines churn out more than 60 million bases of raw data per day, which is processed, assembled, and annotated to yield some 600 million bases of finished sequence per year. Add to that the data submitted from outside to the joint Sanger-EBI Ensembl genome database project, which Sanger also hosts. Along with Vega, the Trace Archive, and the other databases published here, Ensembl attracts eight million page hits a week.
It all adds up to 300 terabytes (TB, approximately 1012 bytes) of storage, equivalent to about 375,000 CD-ROMs, distributed between three computer rooms. Half of that is unique data; the rest consists of mirrored databases, scratch space, duplicates, and workspace, with additional back-ups stored on tape. Across the pond, the National Institutes of Health's genomics repository, the National Center for Biotechnology Information (NCBI), has a similarly sized operation, with 370 TB of storage.
Courtesy of the Wellcome Trust Sanger Institute
Sanger's IT system architecture is in a constant state of flux, with new equipment coming online to replace older, outmoded hardware. In the past, that meant building new rooms. Today, the Sanger is moving to a new facility and embracing a crop-rotation ideology: four computer rooms, with three in use, and one laying fallow for deployment of new systems.
Butcher expects to need petabyte (PB) capabilities (approximately 1015 bytes) possibly within three years. That's a lot of data – whether you think of it as quadrillions of bytes or millions of gigabytes, it is the equivalent of 125 billion pages of
It's not enough to simply store the data, however. Data must also be organized. "We were talking about upping our genotyping rate to 7 million SNPs [single nucleotide polymorphisms] a day, [which] means adding seven million rows to umpteen databases," says Tony Cox, Sanger's head of software services. "We have to integrate things in a sensible fashion; otherwise we get crushed under the weight of it."
When Cox's colleague Martin Widlake, head of database services, arrived at Sanger four years ago, he had four databases to manage. "They were, on average, a couple of hundred gigabytes in size. We now have 40," he says, "and two of them are up into the 40 or 50 terabytes." Widlake adds that newer databases also tend to be much larger because increasing volumes of image data are being stored.
At the same time, of course, the hardware itself is getting smaller, and faster. In accordance with Moore's Law, integrated circuits have been growing in power exponentially. The processors used by Sanger have increased from 800 MHz three years ago to 3.2 and 3.6 GHz today, representing a fourfold increase for the same footprint. Similarly, storage racks have increased from 10 to 35–40 TB for the same size.
But still, Sanger repeatedly outgrows its IT accommodations. Widlake says the Trace sequence archive currently contains about 700 million entries, and it doubles every 10 months. "We're falling behind in the ratio between computing growth and the actual raw data that we're generating," he says.
"To date, the solution has been to build another machine room," says Butcher, "but that's not very cost effective." So this time Sanger is taking an agricultural approach to managing the growth of its IT crop into the future.
The four new machine rooms will provide 1000 m2 of floor space, more than double the present capacity. The IT hardware will be confined to three of those rooms. The fourth will lay fallow, providing the space to test and install new systems as old equipment, which Butcher says generally lasts about three years, is retired. He is optimistic that the rooms will contain IT expansion for longer than the five years for which they were designed.
More and faster machines produce more heat. A single rack holding 170 3.2-GHz processors (Sanger has 10 such racks out of a total of about 150 racks) produces 20 kW of heat, equivalent to 20 small space heaters. Increase the clock rates to 3.6 GHz and the heat output jumps to 30 kW.
So, the new rooms will also be scalable in terms of cooling power. Rather than fighting the laws of physics by blowing cool air up from the ground, it will work with nature and drop it down from above to circle round the racks and be sucked up from the other side. These will cool at a rate of 2 kWm-2, and can be turned up to 4 kWm-2. Converting to liquid coolants is also an option, should things really start heating up.
Courtesy of the Wellcome Trust Sanger Institute
Four years after the declared end of the Human Genome Project, the Sanger still churns out more than 60 million bases of raw data per day. Shown here are a fraction of the more than 150 computer racks in Sanger's IT department that, among other things, process that data and make it available to researchers around the world.
Power bills will also become huge. While the requirements of the current equipment is about 0.75 megawatts at a cost of €140,000 per year, the new setup will need perhaps 1.4 MW, pushing the bill up to €500,000. (According to Jennifer Medley, spokesperson for electricity provider Exelon, 1 MW powers 1,000 homes.)
Meanwhile, the cost of hardware is coming down. Between 2001 and 2006, the Wellcome Trust set aside €64 million for the Sanger's IT demands. Butcher says the budget for the next session is being discussed, but he expects it to be considerably less.
A shift from proprietary to commodity hardware will also help keep costs down, as will the planned move away from a proprietary 64-bit operating system to open-source Linux. Though the move chimes with Sanger's open-source ethos for its sequence data, Butcher cites solid practical reasons for the change. "HP [Hewlett Packard] pulled the plug on the Alpha chip," he says, "so we have nowhere to go." Moving to another proprietary system means it could happen again, he says. "I want something we can rely on and have control over our destiny for a good few years into the future."
THE HUMAN TOUCH
With computer systems outgrowing their habitats as sequence data pours from the labs, compromises must be made at other stages of the genome-processing pipeline, for instance, in sequence annotation.
"Biologists want manually curated, biologically validated annotation," says David DeCaprio of the Broad Institute's annotation informatics group in Cambridge, Mass. "Sequencing technology is getting faster, but we don't have high-throughput biological assays for all kinds of functional behavior." Moreover, skilled human annotators are expensive. "It's basic economics," says DeCaprio. "It's simply not tractable to try to keep up with the sequencing, because the costs are just exorbitant."
Tim Hubbard, Sanger's head of human genome analysis, stresses that automated, statistical annotation continues to improve. "We always had this belief that we could automate this problem away, but you can only go so far," he says. "Annotators do a better job, basically."
This means that new genome projects, such as cow, elephant, and marmoset, are unlikely to be annotated to the same standard as human, mouse, yeast, and worm. DeCaprio sees it as an inevitable trade-off rather than a problem. "We used to have no data; now we have lots of data," he says. "We can find ways of making that data useful without having to go and do tons of manual curation on it."
Hubbard says many of the genomes currently in the pipeline are not intended to be finished to the degree that they require manual annotation. "We're going to learn things about conservation across vertebrates as they've diverged," he says. "But we don't have to annotate it. It's not a priority."
And maybe sequence data will not always be growing as fast as it is now. "Although the 16-month doubling rate appears to be continuing, it's becoming a challenge to the capacity of existing sequencing technologies to keep up an exponential pace," says Dennis Benson, chief of NCBI's information resources branch.
Or, perhaps improvements in sequencing technology will mean that the data grow even faster. At CERN, Hammerle says that when the Large Hadron Collider goes online in 2007, it will generate 15 PB of data annually. A new generation of resequencing machines, such as that developed by 454 Life Sciences, the first of which was installed this year at the Broad Institute at the Massachusetts Institute of Technology, might give the sequencing growth curve a similar kick. Within months, says Cox, "we could be generating the sequence of a whole chromosome in a day."