Data mining as an analytic discipline is neither as obvious as query and reporting nor as carefully positioned, marketing wise, as online analytic processing (OLAP). It encompasses several disparate techniques, and, although it has achieved noteworthy results in applications such as credit assessment, risk management, and market segmentation, it isn't tightly linked to any particular business domain or function. Unlike OLAP's slice-and-dice style of analyzing data cubes, which isn't hard to grasp or master, data mining has long been enveloped in an aura of abstruse, academic inaccessibility. Overall, data mining means and delivers different things to different people. Whatever it means, it's still daunting to many.
Nonetheless, based on its successes and promise, data mining has become a must-have element of enterprise analytic toolkits. There remain many obstacles, however, to broad inclusion of data mining in everyday, operational business intelligence (BI).
Embracing Data Mining
In their quest for profitability and similar business goals, organizations will pursue any angle that can deliver significant ROI by reducing costs, expanding capabilities, or boosting revenues. The basic premise is easy to grasp: Significant untapped value exists in data assets — value that can be unlocked through a "knowledge discovery" process that teases out patterns impossible to detect though conventional, deterministic (nonstatistical) approaches.
Our embrace of data mining implies that we think business analysts' preconceived notions about the factors that drive business, are, on their own, incomplete and limiting. These analysts' notions are encapsulated in dimensional models that presuppose which elements are important without firm mathematical justification for the modeling choices. Forensic analysis — studying historic data and seeking to explain and react to what you find — is also insufficient: Statistically rooted predictive capabilities provide better decision-making. Our embrace of data mining implies that while interactive ("online") data exploration is useful, automated, embeddable analytics are the future.
Into the Mix
Simply tossing new technology into the enterprise mix isn't an optimal approach. Give careful thought to a variety of approaches that weave technology into an organization's fabric. The key goal should be to offer appropriate advanced methods to both business analysts and business-analytics consumers responsible for everyday business operations.
In many business contexts, patterns identified via data mining can be expressed in dimensional terms. These generated dimensions may generate strong predictions and yet be incomprehensible to business analysts. As statistical constructs, they may have little explanatory power. So business analysts are told that black-box wizardry will create a better picture of aspects of the business than they can. The analysts join executives and line-managers as analytics consumers rather than producers.
Producer-consumer differentiation is more significant in data mining, given algorithmic complexities, than it is for mainstream BI technologies. On one end, you have developers and data junkies who build the models and, on the other, people who use the query interfaces, spreadsheets, and dashboards that report model results. Try as the vendors might, typical analytics consumers have neither the training nor the interest in understanding the why of neural nets so long as the what of them is satisfying — except in narrowly defined business domains. In those select domains, where standard data mining algorithms are applied to well-understood business problems such as market-basket analysis, vendors are finding ways to build intelligence into user-facing systems, thus reducing or eliminating the geek factor in building customer models.
More general problems still require dedicated data miners. As IBM BI director Anant Jhingran put it, "Modeling is somewhat of an arcane exercise." At least the modeler's job is getting a bit easier as a result of:
- Improvements in automated model identification where the software scans the data to determine appropriate algorithms to apply
- Efforts (described to me by Jhingran) to reduce the number of modeling parameters required and provide advanced model visualization
- Better graphical design workbenches such as SPSS's Clementine and SAS's Enterprise Miner.
Such workbenches are based on well thought-out methodologies: SPSS, NCR, and a few others rely on the cross-industry standard process for data mining (CRISP-DM), while SAS created and uses the sample, explore, modify, model, and assess (SEMMA) approach. I wonder, however, if these methodologies reinforce the isolation of data mining processes from other analytic functions. These modeled analytic processes aren't holistic business processes; that is, they don't reflect the complex operating environments that put analyses in context and force them to be justified and aligned to business goals. And they certainly don't interact with competing trends toward model-driven mainstream analytics.
I've been writing about the process of designing or choosing a data mining algorithm to apply to a problem and building models that encapsulate a chosen approach. These are producer functions. The situation is more favorable in building model execution (scoring) into mainstream analytics to make data mining results available to analytics consumers.
The primary approach to making data mining results accessible to analytics consumers is to extend industry-standard interfaces into the data mining realm. (By results, I'm referring to the models and not to the numbers or predictions they generate.) The leading DBMS vendors are universally seeking to push definition — direct or via predictive modeling markup language (PMML) model import — and execution of data mining models into the DBMS where they may be accessed via SQL queries. IBM with DB2 and Oracle are there already; Microsoft's SQL Server 2005 release, formerly codenamed Yukon and currently slated for delivery in the first half of 2005, will take this same approach, expanding on the OLE DB for Data Mining interface. That means you can generate your model using your tool and workbench of choice, whether from Angoss, Megaputer, NCR, SAS, or SPSS, export it in PMML, and import it into the DBMS for execution.
Leading vendors — IBM, Microsoft, Oracle, SAS, SPSS, and others — are providing Java (or the rough equivalent, OLE DB for Data Mining, in Microsoft's case) programmatic interfaces for data mining functions. Analytic tool suppliers and corporate developers can use these APIs to embed model execution in their own applications.
On the OLAP side, leading vendors such as Hyperion and MicroStrategy are working hard to make data mining models appear to be no more than, in effect, a dimension in a data cube. Hyperion's recently launched a version of its Essbase engine with a set of predictive algorithms tailored to marketing and performance management problems and designed to produce good results from traditional low-dimensionality cubes. And a forthcoming MicroStrategy release will exploit DBMS-embedded models to deliver capabilities to its OLAP and reporting users. As Paul Turner, Hyperion's director of platform product marketing puts it, "The main goal in adding data mining to the BI platform is to take mining out of the back room... [so] it can be applied to a much more general set of business problems."
Microsoft's 2005 release will be a turning point. That's not for technical reasons, although that release is slated to quadruple the number of supported algorithms and extend them to include sequence and time-series analysis. Microsoft's goal, according to Amir Netz, product unit manager for Analysis Services, is the same as that of other vendors: "to expand the [analytics] market beyond business analysts." But as it did with Analysis Services, Microsoft will capture market share by integrating advanced analytics into its mainstream product offerings and driving down prices.
Data mining's new accessibility is broadening the technology's appeal. It will prove a significant step in boosting enterprise analytic intelligence.
Seth Grimes heads Alta Plana Corp., a Washington, D.C.-based consultancy specializing in business analytics and demographic and economic statistics.
- Angoss Software: www.angoss.com
- CRISP-DM data-mining methodology: crisp-dm.org
- Hyperion: www.hyperion.com
- IBM: www.ibm.com
- Java Data Mining API: jcp.org/en/jsr/detail?id=73
- Megaputer Intelligence: www.megaputer.com
- MicroStrategy: www.microstrategy.com
- Oracle: www.oracle.com
- Predictive Modeling Markup Language Cover Pages: xml.coverpages.org/pmml.html
- SAS: www.sas.com
- SEMMA data mining methodology: www.sas.com/technologies/analytics/datamining/miner/semma.html
- SPSS: www.spss.com
- Microsoft: www.microsoft.com