May 8, 2000
http://www.informationweek.com/785/database.htm
Web Data Piles Up
Dot-com companies are collecting more and more clickstream data. It's chock-full of valuable customer information--but adding up so fast that database management has become a high priority.
By Rick Whiting
ccumulating and analyzing all the mouse clicks on the Web is a one-to-one marketer's dream come true. But it's a data-management nightmare for companies ill-prepared to handle the massive influx of information. User clickstream data--the electronic footprints that show where people go on the Web, what they do or buy, and when they return--is accumulating so fast at some sites that it's testing the limits of conventional approaches to database management. The megadatabases emerging from all this traffic are the centerpieces of Web-site infrastructures--and the jealously guarded assets--of the companies that manage to control them.With 48 million visitors in March, Yahoo was the busiest site on the Web. The portal's database infrastructure affects "how we deliver services--how effectively and economically we deliver services," says Geoff Ralston, VP and general manager of Yahoo's communications group, which manages the company's E-mail, messaging, and online chat operations. "One of the keys to success in this business is being able to scale. Much of the technical know-how in Yahoo has been focused on getting scalability right."
Ralston calls the central database of customer information that supports Yahoo's ability to provide universal logon for all of its services a "crown jewel," though he refuses to talk about it, or any of the multitude of databases the company employs, in any detail. "They're not only mission-critical," Ralston says, "in many cases, they're a competitive advantage."
Web databases can quickly grow into the terabyte range, a size that only the largest brick-and-mortar businesses have reached--and only then after years of data collection. Michael Howard, Oracle's VP of data warehousing, estimates that, driven by dot-coms, there's been a 30% increase in the number of companies looking to build 5-, 10-, or even 15-terabyte data warehouses. Richard Winter, an expert in very large databases, or VLDBs, and president of consulting firm Winter Corp., says the trend is toward clickstream databases that are hundreds of terabytes in size. "E-commerce is giving rise to a new generation of much larger, faster-growing databases," Winter says. "There's very little experience with managing databases of this scale."
This trend is forcing Web database administrators to rewrite the book on database design, storage, backup, and archiving. The issues they're dealing with include database size, rate of growth, storage, when and how data should be summarized and compressed, and which indexing or data-organization techniques best support the need for fast answers when queried. "I'm used to big stuff," says Terry Jones, president and CEO of Travelocity.com LP. "But the problem here is, it's not only large, it's growing fast, and there aren't any road maps to tell us what to do."
As with most business matters, everything starts with money. The cost of building a major database system adds up quickly and can easily account for a sizable share of a Web venture's budget. "The really big systems that are going to handle terabytes of clickstream data are in the tens of millions of dollars," Winter says. For example, an NCR Corp. WorldMark server running NCR's Teradata database--the workhorse platform that powers some of the largest commercial data warehouses--would cost about $13 million with 11 terabytes of disk space, Winter says. Factor in ongoing expenses such as maintenance and building applications, and the total cost can jump two to five times the initial layout.
Then there's the question of know-how. To manage all the data, Web startups are turning to the only source available: tech professionals with hands-on experience managing the biggest databases at brick-and-mortar businesses. Before taking the reins as president and CEO of Travelocity.com, a Sabre Inc. subsidiary, Jones was CIO of Sabre's travel-reservations unit, which runs the monumental Sabre reservation system. Amazon.com Inc. got its CIO, Rick Dalzell, from Wal-Mart Stores Inc., where Dalzell helped manage one of the world's largest data warehouses. And before joining Engage Technologies Inc., a subsidiary of Internet holding company CMGI Inc., as its chief technology officer--where he's helped create a Web database for online marketing--Daniel Jaye was a VLDB expert with Fidelity Investments
What can businesses do to prepare? The busiest Web sites provide the best clues because they're dealing with the problem now. Yahoo, for example, doesn't even use a relational database-management system such as Oracle8i or IBM's DB2 Universal Database for the massive system--it's "tens of terabytes," says Ralston--that supports its E-mail service. Rather, all that E-mail is managed directly by network-attached storage systems from Network Appliance Inc. "Storage is becoming really strategic and core to many things people are doing," Ralston says. "We're coming to the conclusion that storage is absolutely critical to us."

In general, relational databases are better at managing and organizing data than storage systems. And while storage area networks excel at handling big workloads that are spread across multiple servers, they also lack some of the data-management controls of relational databases. For these reasons, many Web environments use all three technologies.
To handle Web data, storage systems should be voluminous and reliable, consultant Winter says, and they must have flexible architectures that can be reconfigured to accommodate dynamic clickstream data. As the amount of Web data increases, SANs must have the kind of automated capabilities provided by EMC Corp. Without built-in intelligence, Winter says, managing Web data in this way "would quickly become a nightmare."
One advantage dot-coms have over established businesses is that it's easier for them to make these kinds of platform decisions because they're starting with a clean slate. Rule No. 1 in dot-com land is to plan for rapid growth--and that goes double for database systems. WinWin.com in Boston collects market data from consumers (who remain anonymous) in return for cash incentives, then provides the information to advertisers. WinWin.com went live last month with 2.4 terabytes of storage disk space from EMC; plans are already in place to ramp up to 4 terabytes by early next year. About half the disk space is used for so-called raw data--the valuable information that comes in over the Web--with the rest devoted to data mirroring and other data-management operations.
WinWin.com projected its storage requirements and database growth when selecting the components of its IT system. "With a database this big, performance becomes an issue," says chief product officer Josh Motto, noting that he expects WinWin.com's database transaction volume to reach as high as 1.2 million transactions per second. To be ready, the company has deployed an Oracle8i database running on Sun Microsystems' high-end E10000 servers.
Advance planning is helpful, but only to a point. Engage split its database operations among two primary platforms: a transactional system for building and maintaining user profiles and a data warehouse for analyzing those profiles. Each system is a combination of Informix's relational database running on Sun servers. But Engage is growing through acquisition, adding new platforms to the mix. The company's I-Pro Web-site management and analysis subsidiary processes 35 billion Web log files and generates more than 100,000 reports each month on an Oracle8i database running on a Sun server. Engage's recently acquired Flycast advertising network has been able to get a performance boost by partitioning its application rather than its Oracle database.
"We certainly haven't found a one-size-fits-all database," CTO Jaye says. Engage is looking at special-purpose database technologies that get around some of the limitations of relational databases. Main-memory databases, for example, process data in a computer system's memory disk space, while multidimensional databases organize data into interrelated hierarchies. Both are fast at problem solving. "We're at the point now of seeing where we can apply some of these novel solutions to certain high-value business problems."
Managing the rate of database growth is a challenge, especially at companies such as DoubleClick Inc. that are constantly expanding their services. DoubleClick, which sells and manages online advertisements for thousands of Web sites, collects 250 Gbytes of clickstream data from its 500-plus advertisement servers and 125 media servers every day. That's up from about 30 Gbytes per day just a year ago. "We figure that by the end of this year we'll be pulling in about a terabyte of data every day," says Bob Linsky, VP of MIS and operations.

"The real challenge, as the applications grow quickly, is being sure that the database's table structures are maintained and don't grow exponentially," Linsky says. That means being vigilant when it comes to adding new naming conventions and element definitions to the database. For example, as DoubleClick adds services, the tendency among the company's 175 application developers is to create new definitions of "customer" for those applications, rather than use a definition already built into the system. That can result in multiple definitions of "customer" in the database, which causes it to grow exponentially. "Reuse rather than re-create" is Linsky's motto.
How to add new information to a database without compromising the integrity of the existing data is a problem all database administrators face. But it's particularly thorny with VLDBs, given their complexity. WinWin.com's Motto says the trick is to design the database with as flexible an architecture as possible. "What we tried to do was anticipate that expansion strategy to the extremes," he says. The database has one main table; a database with lots of small tables is easier to design, but it also increases complexity and slows performance. In addition, stored procedures and triggers were designed to handle a greater range of variables, rather than specific queries, making the database better able to handle different queries.
As databases expand, E-businesses are also wrestling with the question of how much detailed, granular data they need to keep and how much can be stored in a summarized format. Unfortunately for database managers, demands are increasing for keeping every last mouse-click of clickstream and transaction data.
Travelocity frequently taps a database of information on its 19 million customers to develop special travel offers that are E-mailed to segments of the company's customer base. "Our goal, of course, is to convert lookers into bookers," Jones says. For example, Trans World Airlines recently began flights from Los Angeles to San Juan, Puerto Rico, Jones says, so Travelocity analysts searched the company's data warehouse to identify users who had looked at San Juan-related content on Travelocity's Web site and sent them a promotional message.
Keeping detailed data can also be useful for resolving problems such as one Travelocity encountered last year. The company began receiving complaints from customers that airlines, cruise lines, and hotels never received their online reservations. After studying the data from individual customers, Travelocity analysts concluded that after filling out online reservation forms, some customers were closing the Web page rather than clicking the final button that sends the forms on their way. Making that discovery would have been impossible using summarized data. The company corrected the problem by making changes in the design of its Web pages.
For one-to-one marketing applications, detailed granular data is a necessity. "When you're doing personalization, detail is extremely important," Oracle's Howard says. Other data warehouse applications such as fraud analysis also require detailed data.
In the past, summarized data in data warehouses was sufficient for discerning broad market trends. But that's changing as data-mining tools, which require detailed data to be effective, become more popular. "You really can't do data mining with summarized data," says Vicki Farrell, assistant VP of marketing for NCR's Teradata software.
MatchLogic Inc., which provides Web advertising and online marketing services for dot-com startups and big-name companies such as AT&T, General Motors, and Procter & Gamble, gives its clients both granular and summarized data, depending on their needs, says Jack Garzella, senior director of E-business systems. Detailed data may be needed, for example, when clients' customers ask how they got onto an E-mail marketing list or want to "opt out." That requires looking back at the data to determine how and when an individual got on the list, perhaps by registering at a Web site or making a purchase. Managing the frequency with which individuals are targeted for marketing campaigns to ensure potential customers aren't bombarded also requires detailed data.
DoubleClick is able to avoid that effort because granular data isn't needed for the level of reporting the company provides its customers. DoubleClick's database is a manageable 300 Gbytes to 400 Gbytes, Linsky says. But he expects it to grow to about 1 terabyte by year's end. Linsky's operation also maintains a number of smaller databases--subsets of the primary database--that are used for special services DoubleClick offers to publishers and advertising agencies.
There's no easy answer to the question of how long companies need to keep data. "There's no set rule," Garzella says. The MatchLogic executive says the length of time that companies keep detailed data may hinge on the specific industry and its sales cycles. An auto dealer might keep detailed data going back three to five years, for example, while a year's worth of data might be enough for a company that provides day-trading services.
MatchLogic keeps every last bit of data it collects because it never knows for sure what kind of information its clients might need. MatchLogic collects clickstream data from the 1 billion-plus transactions it processes every day. Its IT systems are bulging with about 25 terabytes of data spread across 15 data warehouses. The largest warehouse is 1.7 terabytes.
Garzella says he and his staff have begun studying the question of expiration dates for data. "Is clickstream data worth anything if it's three years old?" he wonders. "I'd like to know, because it's not cheap to store all this data."
Backing up data, once a fairly straightforward chore, also becomes problematic with Web databases. Consultant Winter says that VLDBs require data storage systems with automated backup capabilities. It's a high priority for DoubleClick's Linsky, given that the collected data is essentially the company's principal product. DoubleClick conducts incremental backups on an ongoing basis, Linsky says, plus full weekly backups. The company uses tools from Legato Systems Inc. to back up the log files from its Windows NT-based advertisement servers and Oracle's replication technology to backup the database itself.
Web database administrators are also wrestling with the question of how data should be archived--and how much. Determining what clickstream and transaction data needs to be kept online for internal analysts and external customers and what can be stored in slower archival systems is more an art than a science. Oracle's Howard says many of the company's customers use what he calls a "rolling window operation" approach, always keeping a window of "live" data--such as 13 months worth for seasonal purposes--and then archiving older data.
But usage, rather than age, should be the gauge for judging when data should be archived, says Claudia Imhoff, a senior VP at Braun Consulting and an expert on databases and data warehouses. She advocates using tape or optical disk systems as an alternative to expensive disks for archiving data. IT managers "need to be analyzing what data is being used and how it's being used," she says. That means using a data usage-monitoring tool, she says, such as those sold by vendors such as Pine Cone Systems Inc. and Teleran Technologies LP.
For many companies, however, developing data-archiving plans remains a relatively low priority. "We'll get to an archiving strategy one day," says DoubleClick's Linsky, citing the end of the year as a goal.
Adding to the management complexities is the changing nature of how Web databases are being used. In the past, data warehouses were used primarily for offline analysis, and were far removed from operational IT systems. If they went down for an hour--or even a day or two--the world wouldn't end. With the rise of online sales and marketing, data warehouses are tied directly into E-commerce systems because companies want near-instantaneous feedback on operations, and because they can support personalization and cross-selling applications. "The data warehouse is no longer used primarily for deep analysis," Oracle's Howard says. "The data warehouse has become a production system unto itself."

Even for the few companies that are experienced at managing large Web databases, the task doesn't seem to get easier. Web-site infrastructure "is incredibly important and becoming more so," Ron Sege, executive VP with Lycos Inc., said in a recent public presentation. Lycos, a Web portal, has profiles on 42 million registered users, who generate searches on up to 30 terabytes of information each day. "It's not trivial," he said dryly.
Travelocity and MatchLogic both recently added NCR's Teradata platform to their IT operations when Oracle databases began to bog down under the load. Travelocity has been using Oracle databases for both its transactional systems and its data warehouse, with data moved from the former to the latter on a daily basis. The data warehouse was built with "indexes," predefined queries that are designed to boost system performance.
But that approach wasn't working because Travelocity's marketing analysts bombarded the data warehouse with a wide range of ad hoc queries that rendered the indexing scheme nearly useless. The ability to ask those previously undefined questions is key. "The people who are getting the real value out of data warehouses are asking ad hoc questions," NCR's Farrell says.
Teradata's parallel-processing architecture can rapidly process queries without the need for indexes. "That allows you to ask the right questions and beat the competition," Jones says. Another benefit was that Travelocity's data warehouse actually shrank when converted to Teradata because of the reduced use of summary tables. Travelocity runs its operational systems on servers from SGI Inc. and Sun Microsystems, while the Teradata data warehouse will run on NCR's Worldmark server.
Like Travelocity, MatchLogic's problem was how to maintain the performance of its data warehouses under an increasing number of simultaneous queries from clients and employees. MatchLogic has been using Oracle databases for both its operational and data-warehouse systems. Data is extracted from the Oracle database and prepared for analysis using tools from SAS Institute Inc. That approach works fine when, say, five or fewer people are trying to query the database simultaneously, Garzella says, but the system runs out of steam as the number of simultaneous queries hits 10 or 15. MatchLogic needed to boost that capacity to between 50 and 100 simultaneous users. "We're finding that for some clickstream data warehouses, we're running out of juice," Garzella says.
Using the Teradata system, MatchLogic will consolidate the number of data warehouses it operates from 15 to seven or eight. Garzella expects to get a performance boost from the parallel-processing capabilities of the data warehouse software's core engine. But equally important is the system's TeraMiner technology, which prepares data for queries by pre-processing the data while it's still within the data warehouse. "We're looking at improvements of 15 to 30 times in our clickstream query speeds," Garzella says. The Oracle databases run on Sun 4500 and 6500 servers, while the Teradata system will run on NCR's 5200 Series servers.
One of the early front-runners in using a Web database of customer information for strategic advantage was Peapod Inc., the online grocery-delivery service. The company built a proprietary database engine capable of recognizing repeat customers and offering discounts and promotions based on a customer's profile. Peapod had ambitious plans to exploit its database, but it struggled to turn a profit, and as cash reserves ran dangerously low in March, the company's CEO resigned and its stock plunged. Since then, the Dutch supermarket group Koninklijke Ahold NV has agreed to buy 51% of Peapod for $73 million.
As that example shows, Web databases aren't enough to protect an E-business from all that might go wrong. What's more, they're expensive and complex. But Web databases keep growing in size, and as they do, they're taking on greater importance at the companies building them. "They're the heart of Engage's value," says CTO Jaye. That's a sentiment sure to be repeated at more and more companies.
--with additional reporting by John Foley
Illustration by Robert Nuebecker
Photo of Jones by Alan Blaustein
Photo of Jaye by Shelly R. Harrison
Photo of Linsky by Edward Santalone
Cirrus Logic seeking Digital IC Design Engr in Austin, TX
Hebrew SeniorLife seeking Senior Network Analyst in Boston, MA
Agilent seeking NPI Project Manager in Shanghai, CN
UC Berkeley seeking Helpdesk Team Lead in Berkeley, CA
Rohm and Haas seeking Product Portfolio Manager in Philadelphia, PA
For more great jobs, career-related news, features and services, please visit our Career Center.