Technology How To: Mining Data Warehouses
By Herb Edelstein
Issue Date: Jan. 8, 1996
ata mining finds answers to questions about your business that you haven't thought to ask. It discovers information within data warehouses that queries and reports can't effectively reveal. The potential payoffs from data mining are enormous if you pick the right tools and use them effectively. These applications can become the foundation of your organization's business strategies-determining credit risk, detecting fraud, managing product warranties, purchasing stock for retail stores, and defining new telecommunications products and services.
Traditional database queries are designed to supply answers to simple questions such as "What were my sales in January 1995 in the Northeast region?" Multidimensional analysis, often called online analytical processing (OLAP), lets users do much more complex queries, such as compare sales relative to plan by quarter and region for the prior two years. But in both cases, the results are merely extracted values or an aggregation of values.
Data mining reaches much deeper into databases. Data mining tools find patterns in the data and infer rules from them. Those patterns and rules can be used to guide decision-making and forecast the effect of those decisions. And data mining can speed analysis by focusing attention on the most important variables.
You might find these patterns with a series of queries against the data. But data mining lets users explore a much wider range of possibilities than even the most sophisticated set of queries.
Data mining is taking off for several reasons. Organizations are gathering more data about their businesses. The enormous drop in storage prices has made it feasible to keep huge amounts of data online. Some of this data comes from traditional online transaction processing (OLTP) systems, but much of it is the result of systems put in place in recent years that capture all details of a transaction to help companies better understand what customers really want and do-as opposed to what they say.
For example, some grocery chains are encouraging customers to sign up for cards that give them discounts when the card is presented and scanned at check-out. The store can tell through its scanning system not only what is in each market basket, but who purchased it.
The dramatic drop in the cost/performance ratio of computer systems has enabled many organizations to start applying the complex algorithms that are used in data mining techniques. While many of the basic ideas behind these algorithms have been around for decades, the surge in cost-effective computing in the '80s and the prevalence of data resulted in a host of new algorithms and approaches that are the basis for many of today's products.
The rise of data warehousing also has greatly reduced the barrier to data mining. In the past, it was often necessary to gather the data, cleanse it, and merge it. Now, in many cases, that already has happened and the data is sitting in a data warehouse shouting, "Use me! Use me!"
There are five common types of information that can be yielded by data mining: associations, sequences, classifications, clusters, and forecasting. Associations happen when occurrences are linked in a single event. For example, a study of supermarket baskets might reveal that when corn chips are purchased, 65% of the time cola is also purchased, unless th ere is a promotion, in which case cola is purchased 85% of the time. Knowing this, managers can evaluate the profitability of a promotion.
In sequences, events are linked over time. If a house is bought, then 45% of the time a new oven will be bought within one month and 60% of the time a new refrigerator will be bought within two weeks.
Classification is probably the most common data mining activity today. It recognizes patterns that describe the group to which an item belongs. It does this by examining existing items that already have been classified and inferring a set of rules.
A problem common to many businesses is the loss of steady customers. In the credit-card business this is called attrition, in the cellular phone business it's churn, and in the pharmaceutical business, it's called defection.
Classification can help you discover the characteristics of customers who are likely to leave and provide a model that can be used to predict who they are. It can also help you determine which kinds of promotions have been effective in keeping which types of customers, so that you spend only as much money as necessary to retain a customer.
Clustering is related to classification, but differs in that no groups have yet been defined. Using clustering, the data mining tool discovers different groupings within the data. This can be applied to problems as diverse as detecting defects in manufacturing or finding affinity groups for bank cards.
All of these applications may involve predictions, such as whether a customer will renew a subscription. The fifth application type, forecasting, is a different form of prediction. It estimates the future value of continuous variables-like sales figures-based on patterns within the data.
Four main types of tools are used in data mining: neural networks, decision trees, rule induction, and data visualization. Some tools are based on combinations of these methods.
Neural networks are, essentially, colle ctions of connected nodes with inputs, outputs, and processing at each node. Between the visible input layer and output layer may be a number of hidden processing layers.
The network is capable of learning; it has a training set of data for which the inputs produce a known set of outputs. Each case in the training set is compared with the known outcome; if it differs, a correction is calculated and applied to the processing in the nodes in the network. These steps are repeated until a stopping condition, such as corrections being less than a certain amount, is reached.
Neural networks are an opaque process, which means that the resulting model doesn't have a clear interpretation. It usually is applied without understanding the reasoning behind its results.
Some algorithms can translate a neural net model into a set of rules that can help you understand what the neural net is doing. Some proprietary ne ural net products have this capability.
Many neural net products are used beyond the business information arena. Neural networks frequently are applied to more general pattern recognition problems such as handwriting recognition and interpretation of electrocardiograms.
Decision trees divide the data into groups based on values of the variables. They use a methodology that resembles the game of 20 Questions. The result is a hierarchy of if-then statements that classify the data.
For example: If a customer has made 25% fewer cellular calls each month than the preceding month, for six months, then there is a 60% probability that the customer is going to drop the service.
There has been a surge of interest in decision tree-based products, primarily because they are faster than neural networks for many business problems and easier for users to understand. Pilot Software Inc. in Cambridge, Mass. , is adding a decision tree-based data mining tool to its Lightship multidimensional engine.
Decision trees aren't foolproof, however, and may not work with some types of data. Some decision trees have problems handling continuous sets of data, like age or sales, and require that they be grouped into ranges. The way a range is selected can inadvertently hide patterns. For instance, if age is broken into a 25- to 34-year-old group, the fact that there is a significant break at 30 may be concealed. Information Harvester from Information Harvesting Corp. in Cambridge, Mass., avoids this problem by assigning values to groups in a fuzzy way-each instance of the same value may assigned to a different group.
A set of if-then statements can be every bit as obscure as a neural net, particularly if the condition list is long and complex. Rule induction creates non-hierarchical sets of conditions, which may overlap. For example, Idis from Information Discovery Inc. in Hermosa Beach, Calif., does rule i nduction by generating partial decision trees, and uses statistical techniques to choose which apply to the input data.
Mix and Match
Some vendors combine these approaches. A product due in the middle of 1996 from DataMind Corp. in Redwood City, Calif., will reportedly combine the features of neural networks and decision trees in an attempt to build a more accurate model and do it faster.
The final type of data mining tool is data visualization software. In some ways, data visualization is not really a data mining tool, because it only presents a picture for users to see rather than automating the process. But the visual representations of as many as four variables in a single picture presents an enormous amount of information in a very concise fashion. Essentially, there is a very wide bandwidth of information presentation to the user that will often make groups stand out as peaks or valleys.
Several vendors offer a suite of products in recognition that different pro blems may be best-served by different approaches. For example, Darwin, coming in the first half of 1996 from Thinking Machines Corp. in Bedford, Mass., will offer not only the ability to develop models with a neural net or a decision tree, but also visualization and memory-based reasoning-a classification method that matches cases to similar records whose outcome is already known. It also has a genetic algorithm that can be used for optimizing models.
IBM has also been very active in data mining and has published a considerable amount of research done at its labs. Much of this research is coming to market as a data mining tool kit that addresses four of the five common data mining application types- classification, clustering, sequencing, and associations.
Chopping It Up
One potentially serious problem in data mining is the necessity to subset data for performance reasons. You may have to trade off the number of rows in your sample against the number of variables you evaluate to build your model.
For example, Idis' and Information Harvester's performance are both approximately linear as the number of rows increases, but deteriorate as the number of variables increases.
A group of products from Cross/Z International Inc. in Mitchell Field, N.Y.--called the Fractal Data Mining System--can mine sets of data whose size is virtually independent of the number of rows in the database. However, it is limited to about 12 variables with a total combination of values less than 1 billion. The company's products do this through a two-step representation and compression process.
First, the data warehouse is distilled into a file of query results by asking all possible combinations of questions for all variables of interest. For example, suppose you had a database with columns of sex, state of residence, age, and income, with one row for each individual. The system asks questions like, "How many males in Alaska are age 26?" and also looks for the counts with ag e 27, 28, and so on. It changes the value of every variable until all values have been exhausted. Values like income in this example are used as the basis for metrics, which are kept along with the counts as summaries (e.g., sum or average) in answer to the questions.
The resultant database is typically much smaller than the original data set, because the answers to the questions usually take up much less space than the data itself. In this example, there are only about 20,000 combinations of age, state, and sex, so regardless of whether there were 100,000 rows or 100 million rows, the size of the data set would be about the same.
On top of this, fractal compression is applied to further reduce file size. This compressed file is queried directly. The results can be used on a PC or even stored on a diskette.
It is important to remember that data mining is not magic. Buying a $99 neural net program and thr owing it against a terabyte data warehouse is not likely to produce any useful results. It will take forever to get an answer, and that answer will probably be worthless. But when applied properly, data mining can produce the return-on-investment from your data warehouse that you've been waiting for.
See related sidebar: " Successful Mining "
Herb Edelstein is a partner in Euclid Associates,
a data warehousing and data mining consulting firm in Potomac, Md. He can
be reached at 73377,email@example.com
- I Can See Clearly Now - E2 Conference Boston
- Discover the opportunities and challenges associated with mobile retail - Mobile Commerce World - Mobile Commerce World
- Explore best practices for marketers in the new mobile world - Mobile Commerce World - Mobile Commerce World
- The E2 Social Business Leaders - E2 Conference Boston - E2 Conference Boston
- How to Choose a SaaS Vendor - E2 Conference Boston
- The Untapped Potential of Mobile Apps for Commercial Customers
- Secure Cloud: Taking Advantage of the Intelligent WAN
- Using InfoSphere Information Server to Integrate and Manage Big Data
- The Untapped Potential of Mobile Apps for Commercial Customers
- Get Actionable Insight with Security Intelligence for Mainframe Environments
This Week's Issue
Free Print SubscriptionSubscribe
Current Government Issue
In this issue:Subscribe Now
- The Government CIO 25: These influential and accomplished government IT leaders are finding ways to be cost efficient and still innovate.
- Rethink Video Surveillance: It's not just about networked cameras anymore. New technology provides analytics, automation, facial recognition, real-time alerts and situational-awareness capabilities.
- Read the Current Issue
- HP Newsletter with Gartner Research: Maximizing Your Infrastructure through Virtualization
- Understanding Holistic Database Security 8 Steps to Successfully Securing Enterprise Data Sources
- Information Protection: The Impact Of Big Data
- A How-To Guide on Using Cloud Services for Security-Rich Data Backup
- IBM index reveals key indicators of business continuity exposure and maturity