Database Lessons, Petabyte Style - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Hardware & Infrastructure
01:25 PM
Connect Directly

Database Lessons, Petabyte Style

Many of the largest databases lurk in the back hallways of science and government research--all that data from particle physics experiments, astronomical observations, and weather simulations.

Many of the largest databases lurk in the back hallways of science and government research--all that data from particle physics experiments, astronomical observations, and weather simulations. Managers of these databases avoid some of the challenges faced by their business counterparts because, among other things, data sources tend to be relatively static. But many of the technical challenges are the same and even magnified, given the size of scientific database behemoths.

In 2004, the database at the Stanford Linear Accelerator Center at Stanford University collected 500,000 Gbytes of data a day in the BaBar series of experiments in which physicists from around the world bombarded matter with high-energy subatomic particles. That meant that every 29 days the center was accumulating a volume of data equal to all the books in the Library of Congress.

The database grew to a petabyte until problems led scientists to redesign the system. "Wondering why our database isn't growing any more?" asked a report on the BaBar experiment when it paused to regroup early last year. In a paper, the experiment's organizers outlined problems, including periodic failure of commodity hardware, issues with a half-million lines of custom C++ code added to the center's Objectivity Inc. database, and problems organizing literally millions of data-set collections from experiments.

The Max Planck Institute for Meteorology in Hamburg, Germany, operates a 250-terabyte Oracle database on Linux servers, filled with data from meteorological simulations from researchers worldwide. Any researcher can make use of the data, and it's often downloaded for investigations into global warming and long-range climate forecasting.

At the rate the institute is receiving data, the database will reach a petabyte sometime in 2007, says database administrator Hannes Thiemann. Because of infrequent downloads, he can run the system on five four-CPU Itanium-based NEC Corp. servers and 25 terabytes of NEC RAID-based storage. Much of the data is stored on tape in automated silos.

When a researcher's meteorological simulation is run, generating huge amounts of information, the data is uploaded to the institute's data warehouse and isn't expected to change again. Consequently, the institute's system is rarely bothered with data updates. The data is stored as BLOBs, or binary large objects, that use the relational tables in an Oracle database as a kind of vast bucket, ignoring the retrieval virtues of plucking data out of particular rows and columns. Most of the institute's database interactions are read-only downloads, which greatly simplifies Thiemann's management problems.

"Managing the storage is the most complicated thing we do," Thiemann says. He gets help from the Oracle database's ability to view a tape silo as just another set of disks, without special interfacing. "To the database, it all looks like disks."

Return to main story, Database Lessons, Petabyte Style

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
The State of IT & Cybersecurity Operations 2020
The State of IT & Cybersecurity Operations 2020
Download this report from InformationWeek, in partnership with Dark Reading, to learn more about how today's IT operations teams work with cybersecurity operations, what technologies they are using, and how they communicate and share responsibility--or create risk by failing to do so. Get it now!
The Best Way to Get Started with Data Analytics
John Edwards, Technology Journalist & Author,  7/8/2020
10 Cyberattacks on the Rise During the Pandemic
Cynthia Harvey, Freelance Journalist, InformationWeek,  6/24/2020
IT Trade Shows Go Virtual: Your 2020 List of Events
Jessica Davis, Senior Editor, Enterprise Apps,  5/29/2020
Register for InformationWeek Newsletters
Current Issue
Key to Cloud Success: The Right Management
This IT Trend highlights some of the steps IT teams can take to keep their cloud environments running in a safe, efficient manner.
White Papers
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Sponsored Video
Flash Poll