InformationWeek: The Business Value of Technology

InformationWeek: The Business Value of Technology
e2 Conference & Expo - Boston 2013

InformationWeek.com March 12, 2001

Pan For Gold In The Clickstream

Companies face many challenges as they extend the functionality of their data-mining systems to analyze E-commerce data. It's worth the effort, though, because the business benefits of customer intelligence can be enormous.

By Herbert A. Edelstein   (herb@twocrows.com)

C ompanies venturing into E-commerce have a dream. By analyzing the tracks people make through their Web site, they'll be able to optimize its design to maximize sales. Information about customers and their purchasing habits will let companies initiate E-mail campaigns and other activities that result in sales. Good models of customers' preferences, needs, desires, and behaviors will let companies simulate the personal relationship that businesses and their clientele had in the good old days.

Illustration By Richard Downs

The foundation of this dream is the log of customer accesses maintained by Web servers. A sequence of page hits might look something like this: Page A => Page B => Page C => Page D => Page C => Page B => Page F => Page G. Or more explicitly: Login => Register => Product Description => Purchase.

By analyzing customer paths through the data, vendors hope to personalize the interactions that customers and prospects have with them. Companies will customize the home page each customer sees, the responses to requests, and the recommendations of items to purchase. If you're a customer of Amazon.com Inc., you may already have noticed such personalization. Vendors can also generate a list of related products.

The business benefits of this customer intelligence are potentially enormous. The number of people who come to a site and purchase will increase, and the average amount per purchase will rise, resulting in a dramatic increase in profitability--that's the dream, at least.

The reality is that achieving this goal is difficult and expensive--but it's not impossible. First, to be of any use at all, clickstream data requires enormous amounts of labor-intensive pre-processing. Even then, extracting meaning is still difficult. Second, many customers are reluctant to have vendors track what they do. Their concern is so great that the government is actively considering privacy regulation to limit Web tracking.

To understand the special challenges of clickstream analysis, let's examine the issues in the context of the data-mining process. Data mining uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make accurate predictions. It's through data mining that companies can build the most effective models of their customers and prospects.

The seven steps in this process are:

  • Define the business problem

  • Build data-mining database

  • Explore data

  • Prepare data for modeling

  • Build model

  • Evaluate model

  • Act on the results.

    When defining the business problem, you must be specific about what you want to accomplish. Typical goals might include improving the design of a Web site by identifying the paths people take to arrive at a purchase; detecting problems such as pages that that are never accessed; suggesting strategies for increasing market basket size or increasing the conversion rate (turning visitors into purchasers).

    Building the data-mining database, exploring the data, and preparing it for modeling are the most time-consuming. For clickstream data, these tasks are particularly arduous, consuming 80% to 95% of a project's time and resources.

    These are the key steps in building a data-mining database:

  • Integrate logs

  • Remove extraneous items from log

  • Identify users and sessions

  • Complete paths

  • Identify transactions

  • Integrate with other data.

    First, the raw clickstream data from Web-server access logs must be consolidated from multiple servers and turned into usable records. This is difficult because the pages visited by one individual are usually randomly buried in a mass of other pages, perhaps separated by hundreds or thousands of references to other pages. The common log format files or extended common log files are typically used by the Web server to keep track of the requests that occur at a Web site. The fields available in extended common log files include: host, date/time, request, status, bytes, referring page, and browser type. The actual record would look something like this:

    151.196.31.32--[11/Oct/2000: 00:10:14-0400] "GET /publictn. htm HTTP/1.0" 200 5896 "http://www. twocrows.com/index.htm" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

    Further complicating the process is the fact that a single page request may generate multiple entries on different logs for different server types. There may be page servers, image servers, ad servers, application servers, or other types of servers, each logging actions, sometimes in its own native format. And a logical server may in fact be multiple physical servers for geographic and load-balancing reasons.

    Our goal is to take a sequence of log records like the one above and create a session of page views like this:

    A => B => C => D =>C => B => F.

    Unfortunately, we're starting from a collection of raw log entries that look more like this:

    A, q, w, B, t, y, x, D, z, @, $, C, u, I, p, B, f, F, #, %, ^, G, k, l, v, &, =, }, t, r, b, l.

    The first step in cleaning up the log is to remove the extraneous pages (such as q, w, and @), which may be requests for image files or page requests from Web spiders gathering information for search engines or other non-user requests.

    Next you have to identify the sessions contained in the data stream. This is much harder than it sounds, and is still a subject of research. The problem stems from the fact that the Web was designed to be stateless; that is, each request to a server starts as if nothing happened previously. Consequently, there's no intrinsic way to identify sessions, even if you know the user's identification.

    There are three primary ways to finesse this problem and identify sessions from Web access log data. The first approach is to use heuristics. IP addresses aren't enough to identify a customer because they're not unique to that person. Frequently, an IP address is assigned from a pool of addresses by an Internet service provider. It's sometimes said that half of the United States lives in Vienna, Va., because that's the home of America Online. To identify a session, you can try a combination of IP address, browser type, and pages viewed.

    A second approach is to embed session identification numbers in the URL. This works well as long as the customer doesn't visit another site during the session. If that happens, the session ID is lost upon return and the customer will appear as a new customer.

    But by far the most common approach is to use cookies. A cookie is a text file placed on your computer that contains information about your session and what you did. Many customers don't like cookies, so they refuse to accept them or accept them only selectively. These surfers worry about being tracked or about having mysterious files residing in their computers.

    There are problems that remain even if you successfully identify a visitor to your site.

    You may have noticed in the hypothetical raw log string shown earlier that Page C is missing in the step from Page B to Page D. That's because not all the accesses to a Web site's pages are made on the Web site's servers. Requests for a page may be filled by a different server called the proxy server that sits between the customer and the home server on which the page resides. If a request to your site is filled by a proxy server, you have no information about it. There are estimates that over half of all page requests are filled by proxy servers.

    There are two main types of proxy servers: cache servers and filter servers. Most people have used some kind of cache server in which pages are stored closer to the customer for improved performance. For example, when a customer hits the "back" button on a browsers, the previous page may be retrieved from a local cache; or when he or she first accesses a popular page, it's actually being retrieved from a server at the ISP.

    One challenge is identifying the end of a session and the reason for its end. If a session culminates in a purchase transaction, that's relatively easy to identify. But what if it doesn't? The most common heuristic used to identify the end of a session is to set a fixed time after the last page access at a site, typically 30 minutes. But that doesn't tell you why the session ended. It may have ended because a page took too long to load and the customer went someplace else in frustration, or the customer had to answer the phone and returned to the session 31 minutes later. Clearly, these reasons have differing implications for the vendor.

    The next problem is that many pages are dynamically created, making it hard to know their content. For example, recommendation pages are assembled when the customer visits the vendor's site, so it may take some effort to know exactly what was on the actual page seen in a session. The vast amount of data gathered by Web logs introduces another problem: scalability. Assembling the data, transforming it, and loading it into a database are best performed using parallel hardware and software.

    Even if you succeed in identifying customers, sessions, and the contents of viewed pages, you still need to link to other information, such as related sessions, transaction data, customer data, customer service data, and external data from data providers such as Acxiom, Experian, or Polk. In fact, data from all customer touchpoints should ultimately be available for inclusion in the analysis.

    None of these problems is insolvable. However, the solutions require not only a lot of computer horsepower, but a lot of manual effort.

    Permission marketing makes it much easier to identify sessions and customers. By getting permission from customers to allow cookies, typically when customers register, you can leave the information you need on their PCs. In order to succeed with this strategy, you must tell them what the cookies will do and explain why cookies are to their benefit. For example, with the cookie, customers won't need to remember their ID or re-enter their address when ordering something, and you can provide them with customized pages and recommendations. Unfortunately, this only works with people who register or who are willing to accept cookies.

    Minimizing caching problems is more difficult. Pages can be set to expire immediately, but this increases Web traffic and the load on the page servers, which can significantly reduce response time.

    There are some compromises between no caching and caching everything that can reduce traffic but still let you track what's sent. For example, since many pages consist of a mix of graphics such as GIF files and text HTML files, Blue Martini Software Inc.'s E-commerce software allows images to cache and forces only text HTML requests back to the server.

    One of the most fruitful avenues for dealing with some of these problems is to take advantage of application servers. Rather than look at just the logs for page requests, you can monitor applications to record events. For example, you can tell what was loaded onto a page or when a page didn't completely load,by using invisible GIFs. These are one-pixel square image files with no content. Their only purpose is to generate an entry on a server that lets the contents and progress of a page be tracked.

    The last step in the process is creating an integrated database of clickstream and other data. This can be a complex database design process. While there are common design elements to all the implementations, the actual database design will reflect the business and products of each company. Many issues will need to be addressed, including deciding on an appropriate logical data structure and aggregating information to appropriate levels (such as customers, dates, or products).

    Creating a database for mining clickstream data is a long and complicated process. The next step is to explore the data. First, we start with simple aggregations and distributions to quantify the following:

  • How many people come to a particular Web site?

  • Which sites refer the most visitors, and which sites refer the most visitors who buy something?

  • How many visitors add something to a market basket?

  • How many complete the purchase, and which searches failed the most?

  • What are the best-selling and worst-selling products?

    Visualizations are a useful way to understand your data. By condensing information into a display, graphics let you quickly see how data is distributed, spot unusual values, or notice possible relationships among variables.

    Data transformation is the last step before building models. For example, in trying to predict who will be likely to respond to an offer, you may need to create new variables that are derived from your data. If you're working with existing customers, then RFM (recency, frequency, monetary) variables can be very good predictors.

    Recency may be the number of days since the last purchase. Frequency might be the number of purchases in the last three months. And monetary might be the total purchases in the last three months as well as the average order size over that period. Many

    E-commerce applications make product recommendations to customers based on previous purchases, the item being viewed, or the contents of a shopping cart via methods such as collaborative filtering or association discovery.

    Since these methods typically don't involve the testing phase of true predictive models, they'll generally be less accurate. However, they require much less information than more precise predictive models in that they're based solely on behaviors at the vendor site. Consequently, they can be used with prospects as well as existing customers. In this case, some accuracy is being sacrificed for a reasonable guess, with little downside risk. The only cost associated with being wrong is the lost opportunity of missing a sale that an accurate prediction might have made.

    For site visitors whose identity is known, information about their characteristics and preferences can be factored into predictive models, resulting in more customized predictions. For example, males in one geographic location who placed a particular item in their market basket might receive a different recommendation than females in the same geographic location or males in a different location.

    It's important to evaluate models for accuracy and effectiveness. Effectiveness may be measured by such traditional economic metrics as profitability or return on investment. However, these objective measures are useless if the model doesn't make sense. In particular, because of the large number of variables in E-commerce applications and their intrinsic complexity, there are two errors that must be carefully avoided.

    The first is sometimes referred to as a specification search. If you look at enough variables, sooner or later you'll find at least one that correlates well with what you're trying to predict. For example, on Oct. 30, it was noted on Monday Night Football that every time the Washington Redskins won their last home game before a presidential election, the incumbent party has won. The Redskins lost to the Tennessee Titans that night, and the incumbent party did indeed lose the election.

    It's clear that there's no real linkage between these occurrences, yet a pretty robust predictor was discovered. Before using this relationship to guide your betting in the next presidential election, however, consider how many possible variables were searched before this one was found.

    Similarly misleading results can happen because of lurking variables, in which a variable appears to predict the response variable but in reality does so only through its relationship to a variable that's not being considered. Hair length predicts height, but only because women generally have longer hair but are shorter than men.

    It's necessary to carefully interpret models to make sure they are sensible. Remember that predictive models aren't necessarily revealing the true underlying causes of behavior.

    Having built some models, it's now necessary to act on them. In E-commerce, there are two main classes of customer interaction: inbound, in which the customer comes to the site, and outbound, in which the vendor goes to the customer, as in an E-mail promotion.

    Inbound interactions require quick response to the various stages of the transaction. The relevant information, such as the identity of the customer and items in the shopping cart, must quickly be sent from the current transaction to the modeling engine, which determines the correct action and sends it back to the application.

    Outbound interactions are a bit more leisurely. To identify the targets of a campaign solicitation, the model can be applied in batch to the list of prospective recipients.

    Lastly, it's important to close the marketing loop. The actual effectiveness of the models must be compared with the reality, and if necessary the models and data modified as part of a continuous process of improvement.

    Assembling the data-mining database is a challenging task characterized by large data volumes and complex transformations. Hence, it may be necessary to use more specialized warehouse tools that are parallelized, such as those from Ab Initio Software Corp. and Torrent Systems Inc.

    Building models on large databases can be made easier through sampling (selecting a random set of rows), but even samples can still represent large volumes of data. Here, too, there exist parallelized tools, such as those from IBM, Oracle, and Torrent.

    Increasingly, E-commerce companies are seeking solution-oriented analytical applications aimed at their specific business problems. However, even in the context of these apps, all steps described above still must be carried out.

    Analyzing clickstream data for E-commerce is an evolving application. The challenges are real but not insurmountable. Recognize that you'll need to spend most of your time in preparing your database, dealing with the low quality of some of the data, and trying to make sense of your results. There's still a large element of art in successfully using clickstream data.


    Analytics Tools That Can Help
    Company
    Product
    What It Does

    Accrue
    Fremont, Calif.
    510-580-4500
    www.accrue.com



    Insight 5





     

    Analytic application helps users under stand customer preferences and buying habits, and develop targeted promotions to increase profitability and ROI
    Angoss
    Toronto
    416-593-1122
    www.angoss.com

    Knowledge Webmine

     

    Data-mining application for E-business leverages Web logs and operational data sources

    Blue Martini
    San Mateo, Calif.
    650-356-4000
    www.bluemartini.com

     

    Blue Martini Marketing

     

     

    Platform provides customer analysis and marketing automation; lets companies analyze customer behavior for patterns and personalize content

    E.piphany
    San Mateo, Calif.
    650-356-3800
    www.epiphany.com

     

     

     

    E.5








    Combines analytic and operational customer-relationship management, manages inbound and outbound customer interactions, and supports personalized
    marketing and interactive sales

    NetGenesis
    Cambridge, Mass.
    800-982-6351 www.netgen.com

     

     

    NetGenesis 5

     

     

     

    Web-analysis platform helps users identify, define, and target customer segments, track online marketing, and define and create metrics to track and
    analyze Web-site performance

    Net Perceptions
    Edina, Minn.
    952-842-5000
    www.netperceptions.com

     

     

    E-Commerce Analyst

     

     

     

    Analysis and reporting tool with embedded data mining for analyzing customer behavior and optimizing customer acquisition, retention,
    promotion, and marketing campaigns

    SAS Institute
    Cary, N.C.
    919-677-8000
    www.sas.com

     

    Enterprise

    Miner

     

     

    Data-mining application for E-commerce helps businesses find trends and predict future outcomes using demographic data and customer buying patterns

    SPSS
    Chicago
    312-651-3000
    www.spss.com


     

     

    Clementine








    Data-mining tools help companies create predictive business models using operational data and business intelligence, identify customer interaction sequences, and drive decision-making

    Vignette
    Austin, Texas
    512-741-4300
    www.vignette.com

    Relationship Management Server

    Data-mining tool helps with customer profiling, closed-loop online marketing, and delivery of targeted content

    WebTrends
    Portland, Ore.
    503-294-7025
    www.webtrends.com

     

     

    Commerce
    Trends

     

     

     

    Platform for visitor-relationship manage ment helps businesses understand customer behavior using Web and transactional data to drive personalization and target marketing

    Herbert A. Edelstein is president of Two Crows, a consulting firm that specializes in data mining and business intelligence. You can reach him at herb@twocrows.com.

    Illustration by Richard Downs

    Get InformationWeek Daily

    Don't miss each day's hottest technology news, sent directly to your inbox, including occasional breaking news alerts.

    Sign up for the InformationWeek Daily email newsletter

    *Required field

    Privacy Statement



    Upcoming Events

    This Week's Issue

    Current Healthcare Issue

    In this issue:
    • Healthcare CIO 20: Innovation is tough amid today's regulatory checklists. These leaders are getting it done.
    • Lessons Learned: Boston area CIO John Halamka reflects on the marathon bombing
    • And much more!
    • Read the Current Issue

    Current Education Issue

    In this issue:
    • Hacking Higher Ed: The cybersecurity challenge on college campuses lies as much with the students as with malicious outsiders.
    • When Education Gets Too Virtual: Students can use technology to undermine the integrity of education.
    • And much more!
    • Read the Current Issue

    Featured Whitepapers

    Featured Reports






    Video