The Operations Phase
The operations phase is where the rubber meets the road. At this point, you have the best possible model (given time, data and technology constraints) and business approval to proceed. The operations phase involves three main tasks: implementation, impact assessment and maintenance.
At one end of the spectrum, a customer-profiling data mining model that runs once a quarter may only involve the data miner and ETL developer. At the other end of the spectrum, making online recommendations will require applications developers and production systems folks to be involved, which is usually a big deal. If you're working on a big-deal project, include these people as early as possible—preferably during the business phase—so they can help determine appropriate timeframes and resources. It's best to roll out the data mining model in phases, starting with a test version, to make sure the data mining server doesn't affect the transaction process.
Assessing the impact of the data mining model can be high art. In some fields, such as direct mail, the process of tuning and testing marketing offers, collateral and target prospect lists is full-time work for a large team. These teams perform tests on small subsets before they send mass mailings. Even in full campaigns, there are often several phases with different versions and control sets built in. The results of each phase help teams tweak subsequent phases for improved returns. Adopt as much of this careful assessment approach as possible.
Keep in mind that as the world changes, behaviors and relationships captured in the model become outdated. Almost all data mining models must be retrained or completely rebuilt at some point. A recommendation engine that doesn't include the latest products would be less than useful, for example.
In the best of all worlds, the final data mining model should be documented with a detailed history. A professional data miner will want to know exactly how a model was created in order to explain its value, avoid repeating errors and re-create it if necessary.
Modern data mining tools are so easy to use, it often takes more time to document each iteration than it does to do the work. Nonetheless, you must keep track of what you have and where it came from. Keep a basic set of metadata to track the contents and derivation of all the transformed data sets and resulting mining models you decide to keep. Ideally, your data mining tool will provide the means for tracking these changes, but the simplest approach is to use a spreadsheet.
For every data mining model you keep, your spreadsheet should capture at least the following: model name, version, and date created; training and test data sets; algorithm(s), parameter settings, input and predicted variables used; and results. Your spreadsheet should also track the definitions of the input data sets, the data sources they came from and the ETL modules that created them.
This approach will help you successfully integrate data mining into your DW/BI system. Remember, the easiest path to success begins with understanding business requirements and ends with delivering business value.
Quick Study: Kimball University DW/BI Best Practices
Data mining is becoming more effective, more available and less costly. This three-phased, business-driven approach can help you successfully incorporate data mining into your DW/BI environment.
The design and testing iterations of the various mining models should be tracked in a metadata structure, even if it's a simple spreadsheet.
The Microsoft Data Warehouse Toolkit: With SQL Server 2005 and the Microsoft Business Intelligence Toolset by J. Mundy and W. Thornthwaite (Wiley, 2006).
Data Mining Techniques, Second Edition by M. Berry and G. Linoff (Wiley, 2004).
Cross-Industry Standard Process for Data Mining (CRISP)
Warren Thornthwaite is a member of the Kimball Group. He cowrote The Data Warehouse Lifecycle Toolkit (Wiley, 1998). this column is excerpted from the Microsoft Data Warehouse Toolkit (Wiley, 2006). Write to him at [email protected].