Welcome Guest. | Log In| Register | Membership Benefits
News

May 8, 2000

Printer ready
Printer ready

Web Data Piles Up

Dot-com companies are collecting more and more clickstream data. It's chock-full of valuable customer information--but adding up so fast that database management has become a high priority.

By Rick Whiting

Illustration by Robert Nuebecker
Related links:

  • sidebar: Now Comes The Hard Part: Analyzing The Data

  • Databases Get Boost From Internet And E-Commerce (2/14/00)
  • And from our sister publications:

  • Computer Reseller News Tackles Integration Of Databases, Internet (4/24/00)

  • InternetWeek Legato Back-up Software Now Recovers Databases (2/14/00)

  • TechWeb ASP Begins Hosting Low-End Databases (2/4/00)
  • TechEncyclopedia
    Need a definition of a technology term? Look it up here:


    Send Us Your Feedback
    Accumulating and analyzing all the mouse clicks on the Web is a one-to-one marketer's dream come true. But it's a data-management nightmare for companies ill-prepared to handle the massive influx of information. User clickstream data--the electronic footprints that show where people go on the Web, what they do or buy, and when they return--is accumulating so fast at some sites that it's testing the limits of conventional approaches to database management. The megadatabases emerging from all this traffic are the centerpieces of Web-site infrastructures--and the jealously guarded assets--of the companies that manage to control them.

    With 48 million visitors in March, Yahoo was the busiest site on the Web. The portal's database infrastructure affects "how we deliver services--how effectively and economically we deliver services," says Geoff Ralston, VP and general manager of Yahoo's communications group, which manages the company's E-mail, messaging, and online chat operations. "One of the keys to success in this business is being able to scale. Much of the technical know-how in Yahoo has been focused on getting scalability right."

    Ralston calls the central database of customer information that supports Yahoo's ability to provide universal logon for all of its services a "crown jewel," though he refuses to talk about it, or any of the multitude of databases the company employs, in any detail. "They're not only mission-critical," Ralston says, "in many cases, they're a competitive advantage."

    Web databases can quickly grow into the terabyte range, a size that only the largest brick-and-mortar businesses have reached--and only then after years of data collection. Michael Howard, Oracle's VP of data warehousing, estimates that, driven by dot-coms, there's been a 30% increase in the number of companies looking to build 5-, 10-, or even 15-terabyte data warehouses. Richard Winter, an expert in very large databases, or VLDBs, and president of consulting firm Winter Corp., says the trend is toward clickstream databases that are hundreds of terabytes in size. "E-commerce is giving rise to a new generation of much larger, faster-growing databases," Winter says. "There's very little experience with managing databases of this scale."

    This trend is forcing Web database administrators to rewrite the book on database design, storage, backup, and archiving. The issues they're dealing with include database size, rate of growth, storage, when and how data should be summarized and compressed, and which indexing or data-organization techniques best support the need for fast answers when queried. "I'm used to big stuff," says Terry Jones, president and CEO of Travelocity.com LP. "But the problem here is, it's not only large, it's growing fast, and there aren't any road maps to tell us what to do."

    As with most business matters, everything starts with money. The cost of building a major database system adds up quickly and can easily account for a sizable share of a Web venture's budget. "The really big systems that are going to handle terabytes of clickstream data are in the tens of millions of dollars," Winter says. For example, an NCR Corp. WorldMark server running NCR's Teradata database--the workhorse platform that powers some of the largest commercial data warehouses--would cost about $13 million with 11 terabytes of disk space, Winter says. Factor in ongoing expenses such as maintenance and building applications, and the total cost can jump two to five times the initial layout.

    Then there's the question of know-how. To manage all the data, Web startups are turning to the only source available: tech professionals with hands-on experience managing the biggest databases at brick-and-mortar businesses. Before taking the reins as president and CEO of Travelocity.com, a Sabre Inc. subsidiary, Jones was CIO of Sabre's travel-reservations unit, which runs the monumental Sabre reservation system. Amazon.com Inc. got its CIO, Rick Dalzell, from Wal-Mart Stores Inc., where Dalzell helped manage one of the world's largest data warehouses. And before joining Engage Technologies Inc., a subsidiary of Internet holding company CMGI Inc., as its chief technology officer--where he's helped create a Web database for online marketing--Daniel Jaye was a VLDB expert with Fidelity Investments

    What can businesses do to prepare? The busiest Web sites provide the best clues because they're dealing with the problem now. Yahoo, for example, doesn't even use a relational database-management system such as Oracle8i or IBM's DB2 Universal Database for the massive system--it's "tens of terabytes," says Ralston--that supports its E-mail service. Rather, all that E-mail is managed directly by network-attached storage systems from Network Appliance Inc. "Storage is becoming really strategic and core to many things people are doing," Ralston says. "We're coming to the conclusion that storage is absolutely critical to us."

    Terry JonesPhoto by Alan Blaustein In general, relational databases are better at managing and organizing data than storage systems. And while storage area networks excel at handling big workloads that are spread across multiple servers, they also lack some of the data-management controls of relational databases. For these reasons, many Web environments use all three technologies.

    To handle Web data, storage systems should be voluminous and reliable, consultant Winter says, and they must have flexible architectures that can be reconfigured to accommodate dynamic clickstream data. As the amount of Web data increases, SANs must have the kind of automated capabilities provided by EMC Corp. Without built-in intelligence, Winter says, managing Web data in this way "would quickly become a nightmare."

    One advantage dot-coms have over established businesses is that it's easier for them to make these kinds of platform decisions because they're starting with a clean slate. Rule No. 1 in dot-com land is to plan for rapid growth--and that goes double for database systems. WinWin.com in Boston collects market data from consumers (who remain anonymous) in return for cash incentives, then provides the information to advertisers. WinWin.com went live last month with 2.4 terabytes of storage disk space from EMC; plans are already in place to ramp up to 4 terabytes by early next year. About half the disk space is used for so-called raw data--the valuable information that comes in over the Web--with the rest devoted to data mirroring and other data-management operations.

    WinWin.com projected its storage requirements and database growth when selecting the components of its IT system. "With a database this big, performance becomes an issue," says chief product officer Josh Motto, noting that he expects WinWin.com's database transaction volume to reach as high as 1.2 million transactions per second. To be ready, the company has deployed an Oracle8i database running on Sun Microsystems' high-end E10000 servers.

    Advance planning is helpful, but only to a point. Engage split its database operations among two primary platforms: a transactional system for building and maintaining user profiles and a data warehouse for analyzing those profiles. Each system is a combination of Informix's relational database running on Sun servers. But Engage is growing through acquisition, adding new platforms to the mix. The company's I-Pro Web-site management and analysis subsidiary processes 35 billion Web log files and generates more than 100,000 reports each month on an Oracle8i database running on a Sun server. Engage's recently acquired Flycast advertising network has been able to get a performance boost by partitioning its application rather than its Oracle database.

    continued...page 2, 3

    Illustration by Robert Nuebecker
    Photo of Jones by Alan Blaustein

    Back to This Week's Issue
    Send Us Your Feedback
    Top of the Page

    CAREER CENTER
    Ready to take that job and shove it?



    TechCareers

    SEARCH
    Function:

    Keyword(s):

    State:
    SPONSOR
    RECENT JOB POSTINGS
    CAREER NEWS
    Go beyond Google and get vertical. These specialized search sites will help you find the business information you need -- fast.

    Ari Balogh was named to the post of chief technology officer as the companys for a "realignment" of employees.



    Specialty Resources

    Featured Microsite

     

    Join economist Chris Cornell and 3 CIOs in an Exclusive Online Exchange for Senior IT Executives: Using IT to Drive Value in a Turbulent Economy. November 5th only.