|
|
March 12, 2001 |
|
|
Pan For Gold In The Clickstream
Companies face many challenges as they extend the functionality of their data-mining systems to analyze E-commerce data. It's worth the effort, though, because the business benefits of customer intelligence can be enormous.
By Herbert A. Edelstein (herb@twocrows.com)
| More On Data Mining: |
|
|
ompanies venturing into E-commerce have a dream. By analyzing the tracks people make through their Web site, they'll be able to optimize its design to maximize sales. Information about customers and their purchasing habits will let companies initiate E-mail campaigns and other activities that result in sales. Good models of customers' preferences, needs, desires, and behaviors will let companies simulate the personal relationship that businesses and their clientele had in the good old days.

The foundation of this dream is the log of customer accesses maintained by Web servers. A sequence of page hits might look something like this: Page A => Page B => Page C => Page D => Page C => Page B => Page F => Page G. Or more explicitly: Login => Register => Product Description => Purchase.
By analyzing customer paths through the data, vendors hope to personalize the interactions that customers and prospects have with them. Companies will customize the home page each customer sees, the responses to requests, and the recommendations of items to purchase. If you're a customer of Amazon.com Inc., you may already have noticed such personalization. Vendors can also generate a list of related products.
The business benefits of this customer intelligence are potentially enormous. The number of people who come to a site and purchase will increase, and the average amount per purchase will rise, resulting in a dramatic increase in profitability--that's the dream, at least.
The reality is that achieving this goal is difficult and expensive--but it's not impossible. First, to be of any use at all, clickstream data requires enormous amounts of labor-intensive pre-processing. Even then, extracting meaning is still difficult. Second, many customers are reluctant to have vendors track what they do. Their concern is so great that the government is actively considering privacy regulation to limit Web tracking.
To understand the special challenges of clickstream analysis, let's examine the issues in the context of the data-mining process. Data mining uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make accurate predictions. It's through data mining that companies can build the most effective models of their customers and prospects.
The seven steps in this process are:
When defining the business problem, you must be specific about what you want to accomplish. Typical goals might include improving the design of a Web site by identifying the paths people take to arrive at a purchase; detecting problems such as pages that that are never accessed; suggesting strategies for increasing market basket size or increasing the conversion rate (turning visitors into purchasers).
Building the data-mining database, exploring the data, and preparing it for modeling are the most time-consuming. For clickstream data, these tasks are particularly arduous, consuming 80% to 95% of a project's time and resources.
These are the key steps in building a data-mining database:
First, the raw clickstream data from Web-server access logs must be consolidated from multiple servers and turned into usable records. This is difficult because the pages visited by one individual are usually randomly buried in a mass of other pages, perhaps separated by hundreds or thousands of references to other pages. The common log format files or extended common log files are typically used by the Web server to keep track of the requests that occur at a Web site. The fields available in extended common log files include: host, date/time, request, status, bytes, referring page, and browser type. The actual record would look something like this:
151.196.31.32--[11/Oct/2000: 00:10:14-0400] "GET /publictn. htm HTTP/1.0" 200 5896 "http://www. twocrows.com/index.htm" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"
Further complicating the process is the fact that a single page request may generate multiple entries on different logs for different server types. There may be page servers, image servers, ad servers, application servers, or other types of servers, each logging actions, sometimes in its own native format. And a logical server may in fact be multiple physical servers for geographic and load-balancing reasons.
Our goal is to take a sequence of log records like the one above and create a session of page views like this:
A => B => C => D =>C => B => F.
Unfortunately, we're starting from a collection of raw log entries that look more like this:
A, q, w, B, t, y, x, D, z, @, $, C, u, I, p, B, f, F, #, %, ^, G, k, l, v, &, =, }, t, r, b, l.
The first step in cleaning up the log is to remove the extraneous pages (such as q, w, and @), which may be requests for image files or page requests from Web spiders gathering information for search engines or other non-user requests.
Next you have to identify the sessions contained in the data stream. This is much harder than it sounds, and is still a subject of research. The problem stems from the fact that the Web was designed to be stateless; that is, each request to a server starts as if nothing happened previously. Consequently, there's no intrinsic way to identify sessions, even if you know the user's identification.
There are three primary ways to finesse this problem and identify sessions from Web access log data. The first approach is to use heuristics. IP addresses aren't enough to identify a customer because they're not unique to that person. Frequently, an IP address is assigned from a pool of addresses by an Internet service provider. It's sometimes said that half of the United States lives in Vienna, Va., because that's the home of America Online. To identify a session, you can try a combination of IP address, browser type, and pages viewed.
A second approach is to embed session identification numbers in the URL. This works well as long as the customer doesn't visit another site during the session. If that happens, the session ID is lost upon return and the customer will appear as a new customer.
But by far the most common approach is to use cookies. A cookie is a text file placed on your computer that contains information about your session and what you did. Many customers don't like cookies, so they refuse to accept them or accept them only selectively. These surfers worry about being tracked or about having mysterious files residing in their computers.
There are problems that remain even if you successfully identify a visitor to your site.
You may have noticed in the hypothetical raw log string shown earlier that Page C is missing in the step from Page B to Page D. That's because not all the accesses to a Web site's pages are made on the Web site's servers. Requests for a page may be filled by a different server called the proxy server that sits between the customer and the home server on which the page resides. If a request to your site is filled by a proxy server, you have no information about it. There are estimates that over half of all page requests are filled by proxy servers.
There are two main types of proxy servers: cache servers and filter servers. Most people have used some kind of cache server in which pages are stored closer to the customer for improved performance. For example, when a customer hits the "back" button on a browsers, the previous page may be retrieved from a local cache; or when he or she first accesses a popular page, it's actually being retrieved from a server at the ISP.
One challenge is identifying the end of a session and the reason for its end. If a session culminates in a purchase transaction, that's relatively easy to identify. But what if it doesn't? The most common heuristic used to identify the end of a session is to set a fixed time after the last page access at a site, typically 30 minutes. But that doesn't tell you why the session ended. It may have ended because a page took too long to load and the customer went someplace else in frustration, or the customer had to answer the phone and returned to the session 31 minutes later. Clearly, these reasons have differing implications for the vendor.
continue on to page 2
|
|
|
|
This Week's Issue
Technology Whitepapers
- Creating the Enterprise-Class Tablet Environment - by Yankee Group
- How To Regain IT Control In An Increasingly Mobile World - by BlackBerry
- The BlackBerry PlayBook tablet's Good Bones - by BlackBerry
- Red Alert: Why Tablet Security Matters - by BlackBerry
- New Visual and Wizard-Driven Paradigms for Exploring Data and Developing Analytic Workflows











