|
|
March 12, 2001 |
|
|
Pan For Gold In The Clickstream
continued...page 2 of 2
By Herbert A. Edelstein (herb@twocrows.com)
| More On Data Mining: |
|
|
The next problem is that many pages are dynamically created, making it hard to know their content. For example, recommendation pages are assembled when the customer visits the vendor's site, so it may take some effort to know exactly what was on the actual page seen in a session. The vast amount of data gathered by Web logs introduces another problem: scalability. Assembling the data, transforming it, and loading it into a database are best performed using parallel hardware and software.
Even if you succeed in identifying customers, sessions, and the contents of viewed pages, you still need to link to other information, such as related sessions, transaction data, customer data, customer service data, and external data from data providers such as Acxiom, Experian, or Polk. In fact, data from all customer touchpoints should ultimately be available for inclusion in the analysis.
None of these problems is insolvable. However, the solutions require not only a lot of computer horsepower, but a lot of manual effort.
Permission marketing makes it much easier to identify sessions and customers. By getting permission from customers to allow cookies, typically when customers register, you can leave the information you need on their PCs. In order to succeed with this strategy, you must tell them what the cookies will do and explain why cookies are to their benefit. For example, with the cookie, customers won't need to remember their ID or re-enter their address when ordering something, and you can provide them with customized pages and recommendations. Unfortunately, this only works with people who register or who are willing to accept cookies.
Minimizing caching problems is more difficult. Pages can be set to expire immediately, but this increases Web traffic and the load on the page servers, which can significantly reduce response time.
There are some compromises between no caching and caching everything that can reduce traffic but still let you track what's sent. For example, since many pages consist of a mix of graphics such as GIF files and text HTML files, Blue Martini Software Inc.'s E-commerce software allows images to cache and forces only text HTML requests back to the server.
One of the most fruitful avenues for dealing with some of these problems is to take advantage of application servers. Rather than look at just the logs for page requests, you can monitor applications to record events. For example, you can tell what was loaded onto a page or when a page didn't completely load,by using invisible GIFs. These are one-pixel square image files with no content. Their only purpose is to generate an entry on a server that lets the contents and progress of a page be tracked.
The last step in the process is creating an integrated database of clickstream and other data. This can be a complex database design process. While there are common design elements to all the implementations, the actual database design will reflect the business and products of each company. Many issues will need to be addressed, including deciding on an appropriate logical data structure and aggregating information to appropriate levels (such as customers, dates, or products).
Creating a database for mining clickstream data is a long and complicated process. The next step is to explore the data. First, we start with simple aggregations and distributions to quantify the following:
Visualizations are a useful way to understand your data. By condensing information into a display, graphics let you quickly see how data is distributed, spot unusual values, or notice possible relationships among variables.
Data transformation is the last step before building models. For example, in trying to predict who will be likely to respond to an offer, you may need to create new variables that are derived from your data. If you're working with existing customers, then RFM (recency, frequency, monetary) variables can be very good predictors.
Recency may be the number of days since the last purchase. Frequency might be the number of purchases in the last three months. And monetary might be the total purchases in the last three months as well as the average order size over that period. Many
E-commerce applications make product recommendations to customers based on previous purchases, the item being viewed, or the contents of a shopping cart via methods such as collaborative filtering or association discovery.
Since these methods typically don't involve the testing phase of true predictive models, they'll generally be less accurate. However, they require much less information than more precise predictive models in that they're based solely on behaviors at the vendor site. Consequently, they can be used with prospects as well as existing customers. In this case, some accuracy is being sacrificed for a reasonable guess, with little downside risk. The only cost associated with being wrong is the lost opportunity of missing a sale that an accurate prediction might have made.
For site visitors whose identity is known, information about their characteristics and preferences can be factored into predictive models, resulting in more customized predictions. For example, males in one geographic location who placed a particular item in their market basket might receive a different recommendation than females in the same geographic location or males in a different location.
It's important to evaluate models for accuracy and effectiveness. Effectiveness may be measured by such traditional economic metrics as profitability or return on investment. However, these objective measures are useless if the model doesn't make sense. In particular, because of the large number of variables in E-commerce applications and their intrinsic complexity, there are two errors that must be carefully avoided.
The first is sometimes referred to as a specification search. If you look at enough variables, sooner or later you'll find at least one that correlates well with what you're trying to predict. For example, on Oct. 30, it was noted on Monday Night Football that every time the Washington Redskins won their last home game before a presidential election, the incumbent party has won. The Redskins lost to the Tennessee Titans that night, and the incumbent party did indeed lose the election.
It's clear that there's no real linkage between these occurrences, yet a pretty robust predictor was discovered. Before using this relationship to guide your betting in the next presidential election, however, consider how many possible variables were searched before this one was found.
Similarly misleading results can happen because of lurking variables, in which a variable appears to predict the response variable but in reality does so only through its relationship to a variable that's not being considered. Hair length predicts height, but only because women generally have longer hair but are shorter than men.
It's necessary to carefully interpret models to make sure they are sensible. Remember that predictive models aren't necessarily revealing the true underlying causes of behavior.
Having built some models, it's now necessary to act on them. In E-commerce, there are two main classes of customer interaction: inbound, in which the customer comes to the site, and outbound, in which the vendor goes to the customer, as in an E-mail promotion.
Inbound interactions require quick response to the various stages of the transaction. The relevant information, such as the identity of the customer and items in the shopping cart, must quickly be sent from the current transaction to the modeling engine, which determines the correct action and sends it back to the application.
Outbound interactions are a bit more leisurely. To identify the targets of a campaign solicitation, the model can be applied in batch to the list of prospective recipients.
Lastly, it's important to close the marketing loop. The actual effectiveness of the models must be compared with the reality, and if necessary the models and data modified as part of a continuous process of improvement.
Assembling the data-mining database is a challenging task characterized by large data volumes and complex transformations. Hence, it may be necessary to use more specialized warehouse tools that are parallelized, such as those from Ab Initio Software Corp. and Torrent Systems Inc.
Building models on large databases can be made easier through sampling (selecting a random set of rows), but even samples can still represent large volumes of data. Here, too, there exist parallelized tools, such as those from IBM, Oracle, and Torrent.
Increasingly, E-commerce companies are seeking solution-oriented analytical applications aimed at their specific business problems. However, even in the context of these apps, all steps described above still must be carried out.
Analyzing clickstream data for E-commerce is an evolving application. The challenges are real but not insurmountable. Recognize that you'll need to spend most of your time in preparing your database, dealing with the low quality of some of the data, and trying to make sense of your results. There's still a large element of art in successfully using clickstream data.
|
|
||
|
|
|
|
|
Accrue |
Insight 5 |
Analytic application helps users under stand customer preferences and buying habits, and develop targeted promotions to increase profitability and ROI |
| Angoss Toronto 416-593-1122 www.angoss.com |
Knowledge Webmine |
Data-mining application for E-business leverages Web logs and operational data sources |
|
Blue Martini |
Blue Martini Marketing |
Platform provides customer analysis and marketing automation; lets companies analyze customer behavior for patterns and personalize content |
|
E.piphany |
E.5 |
Combines analytic and operational customer-relationship management, manages inbound and outbound customer interactions, and supports personalized marketing and interactive sales |
|
NetGenesis |
NetGenesis 5 |
Web-analysis platform helps users identify, define, and target customer segments, track online marketing, and define and create metrics to track and analyze Web-site performance |
|
Net Perceptions |
E-Commerce Analyst |
Analysis and reporting tool with embedded data mining for analyzing customer behavior and optimizing customer acquisition, retention, promotion, and marketing campaigns |
|
SAS Institute |
Enterprise Miner |
Data-mining application for E-commerce helps businesses find trends and predict future outcomes using demographic data and customer buying patterns |
|
SPSS |
Clementine |
Data-mining tools help companies create predictive business models using operational data and business intelligence, identify customer interaction sequences, and drive decision-making |
|
Vignette |
Relationship Management Server |
Data-mining tool helps with customer profiling, closed-loop online marketing, and delivery of targeted content |
|
WebTrends |
Commerce |
Platform for visitor-relationship manage ment helps businesses understand customer behavior using Web and transactional data to drive personalization and target marketing |
Herbert A. Edelstein is president of Two Crows, a consulting firm that specializes in data mining and business intelligence. You can reach him at herb@twocrows.com.
return to page 1
|
|
|
|