As companies and research organizations collect more and more data, the size of databases and data warehouses is growing exponentially. Petabyte databases are the next big thing for database managers. This story examines

InformationWeek Staff, Contributor

February 8, 2002

14 Min Read

A petabyte of data is difficult to fathom. Think of it as the equivalent of 250 billion pages of text, enough to fill 20 million four-drawer filing cabinets. Or imagine a 2,000-mile-high tower of 1 billion diskettes. Whatever you do, don't stop there--because it's the amount of data many businesses will be managing within the next five years.

The amount of data the average business collects and stores is doubling each year. If that holds true at a company such as Sears Roebuck & Co., which is combining its customer and inventory data warehouses to create a 70-terabyte system, the retailer will hit the 1 petabyte threshold--1,000 terabytes--within four years. "If you told me in 1994 that we'd be looking at 70 terabytes, I would have said you were nuts," says Jonathan Rand, merchandise planning and reporting director for the Hoffman Estates, Ill., retailer, which has $41 billion in annual sales.

Such is life now that businesses collect data from multiple sources, including customer-relationship management and enterprise resource planning applications, online E-commerce systems, and suppliers and business partners. The steadily falling price of storage also fuels the data deluge, with the cost of storing 1 Mbyte of data now about 1% of what it was 10 years ago.

All this information is meant to help companies get to know their customers better and run their businesses more smoothly. But the growing volume of data means more headaches for IT managers. As they prepare for petabyte power, they must ask business managers some hard questions about what data to collect and how it will be used. They need to assess vendor readiness to deliver huge systems and begin to automate management and administrative tasks--all while balancing the value of more data against the changing cost of keeping it.

So far, no one's cracked the petabyte milestone. But somewhere between five and 10 databases, mostly in government and university laboratories, store several hundred terabytes and are quickly approaching 1 petabyte. One of the biggest resides at the Stanford Linear Accelerator Center in Menlo Park, Calif., a national laboratory at Stanford University where particle-physics researchers study the reams of data generated by the laboratory's particle accelerator in an effort to understand the relationship between matter and antimatter. Researchers add 2 terabytes of data every day to an already-brimming 500-terabyte database, leaving the IT staff to wrestle with the challenging task of managing all that information. "We try to compress it as much as we can to save disk space and cut costs," says Jacek Becla, database group manager. The laboratory keeps 40 terabytes of data on disk for quick access and the rest on tape, yet still connected to the database. It expects to hit the petabyte mark early next year.

The largest database on the drawing boards is at CERN, the European organization for nuclear and particle physics research in Geneva, Switzerland. CERN is constructing a particle accelerator that will begin operating in 2006, and IT managers at the laboratory are designing a system to collect up to 20 petabytes of data from the accelerator every year, potentially leading to the accumulation of hundreds of petabytes. A prototype database CERN is assembling now should reach 1 petabyte by 2004 (see sidebar story, "CERN Project Will Collect Hundreds Of Petabytes Of Data").

Such huge volumes of data will become crucial to the success of sophisticated scientific research. But for businesses, more data isn't always better. Business-technology managers must determine how information will be used to improve a company before deciding how much data they need--and can afford--to collect. "If you're not careful, you can end up with a lot of data you don't need," says Ric Elert, engineering VP at Comscore Networks Inc., a Reston, Va., company that collects and analyzes clickstream data from its clients' Web sites. "Really talk to your business folks and understand what will drive the business," he says.

Leveraging InformationFor some, external factors drive the need to collect more data. ACNielsen in Stamford, Conn., collects and publishes information on how many households watch TV shows and issues ratings. It stores 10 years' worth of demographic information so TV stations and advertisers can see which shows reach which audiences and how that changes over time. But digital broadcast television promises to change the scope of ACNielsen's work. Digital TV will provide five times the amount of programming available today and that means five times the amount of data that ACNielsen must add to its database. For now, it's expanding the capacity of its data warehouse from 1 terabyte to 10 terabytes.

"We've gone up by a factor of 10 every three years," CIO Kim Ross says of his company's data growth. That could accelerate as digital TV expands into more homes. "Our product is information," he says. "Our business seems to demand, and will find justification for, any database that we can build."

Other companies are increasing the size of their data repositories because of acquisitions or entry into new business lines. SBC Communications Inc. in San Antonio runs a 20-terabyte NCR Teradata database (actual data, rather than disk space), up from 10.5 terabytes last year. SBC's data warehouse has nearly doubled in size every year since it was built in 1994, says David Browne, the company's enterprise data warehouse executive director. In the '90s, the database grew because SBC acquired Ameritech, Pacific Telesis, and other companies. SBC combined operational, financial, billing, service-order, product, and network-support data into one system, used by 12,000 employees who make as many as 100,000 queries a day. "Growth of the data warehouse isn't our goal," he says. "Business value is our goal."

As SBC expands into new services, its data stores will grow as well. The company plans to expand into wireless, broadband, and long-distance services, which will mean more data to collect and analyze. But Browne wants to make sure company executives leverage the data warehouse to its fullest. "I'm always going back to users to understand what data they're using, how often, and for what applications," he says.

Sears is expanding its database to make it more useful. The 35-terabyte data warehouse of product-inventory and store-sales information is key to the retailer's restructuring as it tries to rebound from a year of falling sales and profits. Executives recently used the data to determine that Sears wasn't making money on cosmetics and bicycles, two product lines it's abandoning. Sears wants to increase earnings more than 13% this year.


Jonathan Rand (right), with Dave Shoenin. Photo courtesy Black/Toby.


Sears sees opportunities arising from the merger of its product and sales data with customer data, says merchandise planning and reporting director Rand (right, with technology director Dave Schoening).

Sears is merging its product and sales data warehouse with a 20-terabyte database that contains customer-transaction and financial data. That way, it can get a better understanding of customer shopping habits through market-basket analysis. "The opportunities of putting these two together, from a business point of view, are really exciting," merchandise planning and reporting director Rand says. But, he advises, "bring both IT and business representatives in when you build these." Combining the databases could offer the side benefit of making the system easier to manage because all data will exist in a single data model running on one Teradata platform, and because some redundant data can be culled.

There's a simple maxim that explains the explosive growth of operational databases and data warehouses for decision support: Businesses find increasingly sophisticated ways to use large volumes of data as the cost of storing and managing that data drops. Ten years ago, the hardware and software to store and manage 1 Mbyte of data cost about $15, says Stephen Brobst, chief technology officer for NCR Corp.'s Teradata division, whose database software underlies some of the largest data warehouses. Today that figure is down to around 15 cents per megabyte, he says--and it will drop to 1 cent per megabyte by 2005, predicts James Rothnie, chief technology officer at storage-system vendor EMC Corp. That's largely because of rapid advances in disk technology that let companies store ever-larger amounts of data online, immediately accessible to users, rather than offload it to tape. "It's just a question of which customer is going to step up first with a big-enough check," Brobst says.

At today's prices, the cost of buying a petabyte database and its attendant hardware, software, and personnel is probably too big for a private-sector company: $500 million to $750 million, estimates Richard Winter, an industry analyst who specializes in large databases. The return on investment just isn't there. But as storage prices drop because of improved disk-drive technology, and the cost of a petabyte database drops into the $100 million to $250 million range, some large companies will surely take the plunge.

But if businesses are willing, is existing technology able to effectively support petabyte-scale databases? The technology vendors, not surprisingly, say it is. "To a large degree, I think today's technology can do it," says Ken Jacobs, Oracle's VP of server technologies product strategy. "I don't see any fundamental impediments." But customers aren't so sure. "Every vendor thinks they can handle this," says Comscore VP Elert, who's skeptical of today's technology's ability to ramp up to a petabyte. Still, he admits, it's hard to verify the claims, because it would require building a huge, expensive test system.

One of the tricky management problems for huge databases is that the software tools for database management, data loading, and query and reporting may not be able to handle the huge volumes of data in a petabyte database, IT managers say. "We've tried to do things with tools whose vendors claimed they could handle very large volumes of data and been burned," says SBC's Browne, who declines to cite specific products.

Even if vendors did have the technology to support a petabyte of data, they certainly lack the experience, says Winter, who publishes a well-known list of the largest private databases. "Whether they're really prepared to support such an operation remains to be seen," he says. "It's likely there's some pioneering to be done there." ACNielsen's data warehouse, for example, requires multiple servers with a single database image. That challenge prompted Sybase and Sun Microsystems last year to create the iForce data warehouse reference architecture to handle more than 25 terabytes of raw data. IForce is a blueprint that companies can use to assemble their own large data warehouses.

Most IT managers aren't ready, either. They'll need to make changes in their day-to-day procedures to work with supersize databases. "IT leaders and CIOs need to understand that they can't manage very large databases as they do older, smaller databases," says Dan McNicholl, CIO of General Motors North America in Detroit. The GM division operates a number of large databases, including its engineering and product-development database, with 8 terabytes of raw data and 22 terabytes of disk space.

"What works on the terabyte side fails miserably as you get close to a petabyte," says Rothnie at EMC, which has storage systems that support huge data warehouses, including the one at Sears. In October, EMC debuted its AutoIS series of products for automating complex data-storage management tasks. Routine jobs such as database backup and recovery, data loading, and batch-job scheduling become extraordinarily more complex with petabyte-scale databases, so automating as many of the processes as possible is critical.

But automation isn't enough. Operating huge databases also requires practices that some IT shops aren't used to. Comscore, for example, stores 9 terabytes of clickstream data on 27 terabytes of disk space that's partitioned into two sections. One section holds aggregated data and the other stores detailed data. Both are regularly backed up on tape, a process that takes several days, but only the aggregated data is backed up within the database. That's because the aggregated data has undergone more processing and would take more work to reconstruct if lost. "You have to make some real heartfelt decisions about what to back up," Elert says.

With larger data volumes, programmers likely will have to rewrite algorithms and programs that perform routine tasks, such as searching file names. Database queries must be more tightly controlled to conserve processing resources. ACNielsen CIO Ross warns that IT managers must carefully think through database-segmentation schemes to ensure that no single server assumes a disproportionate share of the work.

For the folks at the Stanford Linear Accelerator Center, designing the database's complex schema (which defines the data structures) was a major chore. "It was a long process to get it right," Becla says. "The challenge was to get the scalability and performance we needed."

Balancing workloads among dozens, even hundreds, of servers supporting a distributed very large database will present significant programming and system-configuration challenges because of the volumes of data and the number of servers involved. In November, Florida International University in Miami went live with a 20-terabyte database of high-resolution aerial and satellite images of the entire United States provided by the U.S. Geological Survey. The system is believed to be the largest publicly accessible database on the Internet. But data for individual images is distributed to avoid overloading any single server in the event of a spike in demand for a particular image--say, as the result of a natural disaster. To maintain the database's performance, the school is creating a hierarchy of data caches to hold more frequently accessed images. The caches should reduce the primary database's workload.

Databases operating across hundreds or thousands of processors will require sophisticated clustering. Oracle says its Real Application Clusters software addresses scalability concerns, although customers, for the most part, are running the software in trial mode rather than in full production. "It's looking promising," says Jamie Shiers, a database group leader at CERN.

Any system built on a monolithic architecture, as opposed to distributed, will run out of headroom long before reaching a petabyte. That's why IT managers who can imagine a petabyte in their future should be running their database today on a highly scalable system, including massively parallel processor hardware such as NCR's Worldmark or IBM's eServer xSeries to which processor nodes can be added as the system expands. "If you start with a base architecture that isn't scalable, then you're going to experience massive rework and redesign at massive costs," SBC's Browne says. SBC runs its Teradata data warehouse on a 176-node NCR Unix server. CERN also is experimenting with low-cost clustered nodes of Intel IA-32 processors running Linux.

To boost scalability for its current data warehouse, ACNielsen runs multiple servers with a complete copy of its media ratings database on each server. But CIO Ross says that approach isn't cost-effective with larger volumes of data. "When you're working with 10 terabytes of data, that's a very expensive proposition," he says. The new data warehouse will be structured as a single database distributed across six Sun servers, yet be managed to present a single image to users and database administrators.

Another potential problem is moving data into and out of huge databases. With huge amounts of data, input/output buses in computers and storage devices could become performance bottlenecks. The InfiniBand architecture, high-speed networking technology being developed by an association of more than 200 vendors, will allow high-speed, point-to-point data communications among servers, networks, and storage devices. Some vendors plan to ship InfiniBand-ready products later this year.

The consensus among industry analysts, database-software vendors, and IT managers is that petabyte-scale databases in the commercial arena, measured by total disk space, will make their debut sometime in 2003 or 2004. A database with a petabyte of raw data is further off--about five years away, says Tim Donar, a senior system architect at Acxiom Corp., which builds and operates data warehouses for large clients. Although Acxiom manages about 200 terabytes of raw data for its customers, only 15 to 20 databases are more than 1 terabyte, with the largest 6 terabytes.

The typical American consumer now generates some 100 Gbytes of data during his or her lifetime, including medical, educational, insurance, and credit-history data, EMC's Rothnie says. Multiply that by 100 million consumers and you get a whopping 10,000 petabytes of data. A petabyte of data may seem like a lot to swallow today, but businesses' appetite for information shows no signs of diminishing.

Photo courtesy Black/Toby

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights