Database Lessons, Petabyte Style
Many of the largest databases lurk in the back hallways of science and government research--all that data from particle physics experiments, astronomical observations, and weather simulations.
Many of the largest databases lurk in the back hallways of science and government research--all that data from particle physics experiments, astronomical observations, and weather simulations. Managers of these databases avoid some of the challenges faced by their business counterparts because, among other things, data sources tend to be relatively static. But many of the technical challenges are the same and even magnified, given the size of scientific database behemoths.
In 2004, the database at the Stanford Linear Accelerator Center at Stanford University collected 500,000 Gbytes of data a day in the BaBar series of experiments in which physicists from around the world bombarded matter with high-energy subatomic particles. That meant that every 29 days the center was accumulating a volume of data equal to all the books in the Library of Congress.
The database grew to a petabyte until problems led scientists to redesign the system. "Wondering why our database isn't growing any more?" asked a report on the BaBar experiment when it paused to regroup early last year. In a paper, the experiment's organizers outlined problems, including periodic failure of commodity hardware, issues with a half-million lines of custom C++ code added to the center's Objectivity Inc. database, and problems organizing literally millions of data-set collections from experiments.
The Max Planck Institute for Meteorology in Hamburg, Germany, operates a 250-terabyte Oracle database on Linux servers, filled with data from meteorological simulations from researchers worldwide. Any researcher can make use of the data, and it's often downloaded for investigations into global warming and long-range climate forecasting.
At the rate the institute is receiving data, the database will reach a petabyte sometime in 2007, says database administrator Hannes Thiemann. Because of infrequent downloads, he can run the system on five four-CPU Itanium-based NEC Corp. servers and 25 terabytes of NEC RAID-based storage. Much of the data is stored on tape in automated silos.
When a researcher's meteorological simulation is run, generating huge amounts of information, the data is uploaded to the institute's data warehouse and isn't expected to change again. Consequently, the institute's system is rarely bothered with data updates. The data is stored as BLOBs, or binary large objects, that use the relational tables in an Oracle database as a kind of vast bucket, ignoring the retrieval virtues of plucking data out of particular rows and columns. Most of the institute's database interactions are read-only downloads, which greatly simplifies Thiemann's management problems.
"Managing the storage is the most complicated thing we do," Thiemann says. He gets help from the Oracle database's ability to view a tape silo as just another set of disks, without special interfacing. "To the database, it all looks like disks."
Return to main story, Database Lessons, Petabyte Style
About the Author
You May Also Like