Data, Data, Everywhere

Megaterabyte databases are getting downright common. But with more real-time data, complex queries, and increasing numbers of sources, managing them is anything but routine.

Charles Babcock, Editor at Large, Cloud

January 6, 2006

4 Min Read
InformationWeek logo in a gray background | InformationWeek

EBay's database of customer information and products fields 750,000 queries a day, with the traffic some days peaking at 1 million, Pride says. The company's developers build business services around the massive data system, such as the Marketplaces Research tool that lets sellers research customer activity with respect to a particular item, including customers' behavior on the site, how they searched for items, and what captured their attention.

A major recent refinement provides the option of searching a particular time period instead of just allowing the most recent site activity to be viewed. Demand for that feature came not from outside sellers but from eBay product managers and analysts who wanted to use it to see how to build demand for specific items and manage auctions. But the tool increases query traffic and complexity. "We accommodate rather than eliminate such demands," Pride says, and now eBay sees "a lot of business value coming out of its use."

Valuable Surprises
Accommodating such demands can be difficult if the database tables haven't been organized and indexed for that. But frequently, it's the unexpected and complex query that provides surprising results. By analyzing what items appeared together on check-out lists, Wal-Mart found a relationship between facial tissues and orange juice sales and positioned those items closer together in stores, Phillips says.

Solid practices for managing stored data, properly indexed tables for query processing, and good data extraction, transformation, and loading techniques to ensure that users are able to correctly interpret data are the keys to keeping a 100-terabyte database from getting out of control, database administrators in the 100-terabyte club say. The basic rules of database management are the same for big and small databases, but the economics aren't, given that the computer hardware, software, and storage needed to manage 100 terabytes of data can run into the millions of dollars.

Companies are "spending big sums to get to the next level of detail. If there wasn't business value, then this trend would stop," says Richard Winter, president of Winter Corp., a consulting firm specializing in assembling and managing large database systems that periodically surveys business, academia, government, and other groups to identify the largest databases. The survey is voluntary--eBay and Wal-Mart aren't on the list--but it provides some measure of how big is "big" and an indication of database growth rates and trends.

In Winter Corp.'s most recent survey, conducted in mid-2005, the Yahoo Search Marketing database came out on top as the largest commercial database, with 100.4 terabytes of data running on an Oracle database and Unix-based Fujitsu-Siemens server. Second place went to AT&T Labs Research, which was running a 93.9-terabyte data warehouse using its proprietary Daytona database software running on a Unix-based Hewlett-Packard server. That system has since exceeded 100 terabytes, says David Browne, AT&T executive director of enterprise data warehousing.

Coping With The Data Deluge
Winter researchers conclude that the largest databases triple in size every two years. To understand why, one only has to look at Nielsen Media Research, the TV ratings company. Nielsen no longer just monitors what a family watches while gathered about the TV in the living room. Rather, it has to collect data on multiple TVs per household and multiple family members for the data to be meaningful.

Nielsen database administrator Tim Geary manages not one massive data stream but multiple streams into his data warehouse, collecting data from meters inside 12,000 households. Families often have satellite services or set-top boxes that let them record a program and view it later on TiVo. "Some people are watching the 6 o'clock news at 8 o'clock. A lot of other people record a program but never watch it. We have to watch the playback data," Geary says.

In the future, Geary says, Nielsen plans to collect TV-watching data from viewers who are outside the home in gyms or bars or carrying video segments with them on iPods.

That complicates his job running a data warehouse that grows 20 Gbytes daily, the equivalent of 40,000 books. The Nielsen data warehouse, running on Sybase IQ software on a Sun Microsystems server, has doubled in size every year for the last three years and now totals 20 terabytes of compressed data. Uncompressed, the data warehouse would be 80 to 100 terabytes, Geary says, making it eligible for membership in the 100-terabyte club.

Are there any limits? Wal-Mart's Phillips isn't betting that the retailer's rate of data accumulation will level anytime soon. Last year he added another stream of data, coming from the RFID tags Wal-Mart now requires its top suppliers to use on all shipments. He anticipates that the next generation of tags will periodically measure the temperature of chilled or frozen goods on their way to market. The readings will go into the data warehouse as proof that produce and frozen foods were kept at an acceptable temperature or should be disposed of because they weren't. He'll have no problem adding such data to the data warehouse, he says. Its business value will justify the expense.

"We update over a billion records per day," he says. "The data grows because the company grows."

This story was updated Jan. 30.

About the Author

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights