Biosilico 2003: The Data Tsunami

We seem to be heading for vocabulary exhaustion. In a time when “extreme” is a common marketing adjective we’re running out of superlatives. So the term “tsunami”—the biggest of all waves—was used at BIOSILICO to describe the rate of data generation in life science. And just as a tsunami slamming the coastline is a big problem, the wealth of data pouring in to genomic and proteomic databases is overwhelming the ability of scientists to figure it out and of engineers to build computer systems that can handle it.


Silicon Graphics CEO Robert Bishop reviewed the power of famed Moore’s Law about computer processors that essentially says that you can get twice the performance in half the space (on a chip) at half the price every couple of years. The processor industry produced a 1,000,000-fold increase in price/performance in 20 years.

But the message was that even Moore’s Law is not increasing CPU speed fast enough to keep up with the growth in data volume that will be coming out of life science research in the next decade or two. The great achievement of the Human Genome Project I mentioned in the previous post is snowballing into something that nobody knows quite how to manage. Leroy Hood, MD, PhD, and founder of the Institute for Systems Biology said that since 1985 there has been a 4,000-fold increase in gene sequencing productivity and there will be another 4,000-fold increase in the next 10 years. Consequently, the life sciences are generating a terabyte (trillion bytes) of information per day today; they will generate a petabyte (1,000 terabytes) of data per day in the near future and eventually exabytes (1,000,000 terabytes) per day.

Nobody gave figures for increase in data yield from microarrays (often called gene chips), but one company, Affymetrix, in mid-October put on the market the first array with detection elements for all 35,000 human genes on the same chip. Two other companies will have comparable chips on the market by the end of the year. As the article reporting this put it:

The dense new arrays bring science closer to the era of individualized medicine, when doctors may be able to choose treatments based on the underlying genetic differences between people who, outwardly, appear to have the same disease. Potentially, doctors may one day pop a blood sample into a bench-top machine and scan a patient’s entire genome for problematic DNA.

So there’s the vision—individualized molecular medicine, i.e., medicine that pegs you right down to your unique molecular characteristics. You can put cancer in the forefront of diseases for which such detailed information is expected to be highly relevant. In the meantime, there’s a bunch of big hurdles:

  • Computers with bigger and bigger performance abilities are needed to receive, store and crunch the data into something meaningful in a respectable span of time. A VP from Apple Computer was there to tout the performance of the new Mac G5 as a scientific platform. He foreshadowed the announcement this week that the #3-ranked supercomputer in the world at Virginia Tech is a cluster of 1,100 Mac G5s. The numbers are eye-popping: 9.55 teraflops (trillions of operations per second) of processing power at the K-Mart price of $5,300,000. The implications are not insignificant. Having access to that much power at such a low price extends the capacities of in silico life science projects.
  • Programs and mathematical approaches are needed to turn measurements from the devices into relationships that are really indicative of biological processes. Biologists are more that sated with data; what they need is analysis and interpretation. One speaker said flatly that, until biologists themselves become competent in mathematics like physicists are, most of the data will be useless, the analyses will be potentially wrong, and the analytical algorithms concocted by hired-gun mathematicians who are biology-challenged will be less than the best.
  • Deeper understanding of the biology behind the data-blips is essential. Scientists have yet to understand the versatility and robustness of living systems. It isn’t just a matter of detaining it out; living systems have alternative ways of working built-in which cause us to have to understand all the alternative ways the systems can act and react. We need to know, not just details, but the underlying principles.

While pictures of big waves—tsunamis I guess—were interspersed among the PowerPoints, the message was not that any of this is insurmountable. It’s just that there are not going to be any miraculous breakthroughs under current conditions. There’s just lots of work to do.

Will the virtuous (or is it vicious?) cycle of more-machinery-more-data continue? Well, consider this: Andrew Berlin, PhD, the director of a new department at Intel, Precision Biology—that’s right, biology at Intel—is developing devices that enable the isolation and examination of single molecules. Using microfluidics they already can suspend a single protein molecule in a teensy-weensy chamber and use various probes to see how it’s made. They know how to do this, Berlin says, because computer chip making is largely a process of extremely fine control of molecules. Intel has some of the best chemistry knowledge in the world.

The first installation of this kind of machine will be at the Fred Hutchinson Cancer Center. The first project is to see if they can use extremely precise processes to detect cancer biomarkers in serum. That’s really pushing the envelope of how small a concentration of a cancer-associated protein they can pick out. It’ll be a good trick because, he says, there are 100,000 different proteins in serum. The dilemma: more data to crunch. Oh well, just more sales of what, Pentium X chips?

Leave a Reply