Just a few years ago, a data warehouse or transactional database that approached a terabyte was considered big. Today, "big" means tens of terabytes. Here's the story behind four of the largest data systems in the world, plus a government database project expected to reach up to 5 petabytes (or 5,000 terabytes) within several years and up to 50 petabytes in 20 years. All are examples of organizations pushing the edge of what's possible with database technologies.
With its 26-terabyte data warehouse, AT&T Labs can detect fraudulent use of calling cards and investigate calls related to kidnappings and other crimes. It also can tally millions of call-in votes from TV viewers picking the next American Idol.
Those are some of the more exciting roles for a data warehouse that holds two years of detailed records of long-distance and local phone calls that traverse the AT&T network. But the 3,000 AT&T employees who tap into the system mostly use it for far more routine chores, such as analyzing call volumes to plan network expansions and upgrades, checking for billing errors, and calculating prices to pitch new services to customers.
AT&T employees want even more out of the 26-terabyte data warehouse, Hall says.
Photo of Sandy Hall by Giorgio Palmisano
The warehouse, split between data centers in two undisclosed locations, holds data that in a previous form would amount to 96 terabytes, says Sandy Hall, who heads the customer- and service-management department at AT&T. That data is compressed down to 26.3 terabytes, which still makes it one of the largest data warehouses in the world, according to Winter Corp., a research firm that tracks decision-support and online transactional-processing databases worldwide.
Data enters the data warehouse in near real time directly from AT&T's operational-billing and network-management systems, letting it provide answers to queries almost instantly. The company can compile and analyze consumer calls in response to AT&T television ads an hour after they run. Before the data warehouse was built in 1997, AT&T marketing personnel had to wait four to six weeks for billing reports to analyze whether an ad translated to sales. Now an analyst can query the system for all calls made to a country from a specific area code during a specific month and get an answer within a minute, Hall says. Data from additional sources such as credit bureaus is added to the mix to help with the analysis.
The system was built with two weeks of call-record data and reached two years of data by 2002. Data older than two years is stored in an offline archive, but Hall's team is studying whether those records might prove useful enough to keep in the data warehouse. AT&T's marketing folks, in particular, have an insatiable appetite for the data. "They want more, more, more," Hall says.
AT&T strictly controls how call records can be used. While the company's customer information can be tapped for marketing campaigns, data about non-AT&T customers whose calls cross the AT&T network can't. Plus, data is partitioned so that employees can see only the data relevant to their jobs. All employee access to data is logged and audited so anything improper can be traced back to an individual. "Having lots of data gives you lots of power, but you have to make sure it's used responsibly," Hall says.
To provide a foundation for the data warehouse, AT&T internally developed a database, which it calls Daytona, to get the performance it needed. "We found we couldn't load a day's worth of data in a day, let alone an hour's worth of data in an hour," Hall says of the commercial databases the company considered. "We operate at a level of scale that other products can't match."
The data warehouse runs on Sun Microsystems Enterprise 10000 servers, uses 2,670 disk drives for storage, and is coupled with technology that AT&T developed to analyze data for fraud and other problems as it's loaded. In addition to having copies of the database at its two data centers, data is streamed to tape at the same time it's loaded into the warehouse.
The biggest challenge? Making sure the system is always up and running, Hall says. It's AT&T's policy that the data warehouse must be always available, so maintenance is handled in a piecemeal fashion.
-- Rick Whiting
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.