ata mining finds answers to questions about your business that you haven't
thought to ask. It discovers information within data warehouses that queries
and reports can't effectively reveal.
The potential payoffs from data mining are enormous if you pick the right
tools and use them effectively. These applications can become the foundation
of your organization's business strategies-determining credit
risk, detecting
fraud, managing product warranties, purchasing stock for retail stores,
and defining new telecommunications products and services.
Traditional database queries are designed to supply answers to simple questions
such as "What were my sales in January 1995 in the Northeast region?"
Multidimensional analysis, often called online analytical processing (OLAP),
lets users do much more complex queries, such as compare sales relative
to plan by quarter and region for the prior two years. But in both cases,
the results are merely extracted values or an aggregation of values.
Data mining reaches much deeper into databases. Data mining tools find patterns
in the data and infer rules from them. Those patterns and rules can be used
to guide decision-making and forecast the effect of those decisions. And
data mining can speed analysis by focusing attention on the most important
variables.
You might find these patterns with a series of queries against the data.
But data
mining lets users explore a much wider range of possibilities than
even the most sophisticated set of queries.
Data mining is taking off for several reasons. Organizations are gathering
more data about their businesses. The enormous drop in storage prices has
made it feasible to keep huge amounts of data online. Some of this data
comes from traditional online transaction processing (OLTP) systems, but
much of it is the result of systems put in place in recent years that capture
all details of a transaction to help companies better understand what customers
really want and do-as opposed to what they say.
For example, some grocery chains are encouraging customers to sign up for
cards that give them discounts when the card is presented and scanned at
check-out. The store can tell through its scanning system not only what
is in each market basket, but who purchased it.
The dramatic drop in the cost/performance ratio of computer systems has
enabled many organizations to start applying
the complex algorithms that
are used in data mining techniques. While many of the basic ideas behind
these algorithms have been around for decades, the surge in cost-effective
computing in the '80s and the prevalence of data resulted in a host of new
algorithms and approaches that are the basis for many of today's products.
The rise of data warehousing also has greatly reduced the barrier to data
mining. In the past, it was often necessary to gather the data, cleanse
it, and merge it. Now, in many cases, that already has happened and the
data is sitting in a data warehouse shouting, "Use me! Use me!"
Information, Please
There are five common types of information that can be yielded by data mining:
associations, sequences, classifications, clusters, and forecasting. Associations
happen when occurrences are linked in a single event. For example, a study
of supermarket baskets might reveal that when corn chips are purchased,
65% of the time cola is also purchased, unless th
ere is a promotion, in
which case cola is purchased 85% of the time. Knowing this, managers can
evaluate the profitability of a promotion.
In sequences, events are linked over time. If a house is bought, then 45%
of the time a new oven will be bought within one month and 60% of the time
a new refrigerator will be bought within two weeks.
Classification is probably the most common data mining activity today. It
recognizes patterns that describe the group to which an item belongs. It
does this by examining existing items that already have been classified
and inferring a set of rules.
A problem common to many businesses is the loss of steady customers. In
the credit-card business this is called attrition, in the cellular phone
business it's churn, and in the pharmaceutical business, it's called defection.
Classification can help you discover the characteristics of customers
who are likely to leave and provide a model that can be used to predict
who they are. It can also
help you determine which kinds of promotions have
been effective in keeping which types of customers, so that you spend only
as much money as necessary to retain a customer.
Clustering is related to classification, but differs in that no groups have
yet been defined. Using clustering, the data mining tool discovers different
groupings within the data. This can be applied to problems as diverse as
detecting defects in manufacturing or finding affinity groups for bank cards.
All of these applications may involve predictions, such as whether a customer
will renew a subscription. The fifth application type, forecasting, is a
different form of prediction. It estimates the future value of continuous
variables-like sales figures-based on patterns within the data.
Four main types of tools are used in data mining: neural networks, decision
trees, rule induction, and data visualization. Some tools are based on combinations
of these methods.
Neural networks are, essentially, colle
ctions of connected nodes with inputs,
outputs, and processing at each node. Between the visible input layer and
output layer may be a number of hidden processing layers.
The network is capable of learning; it has a training set of data for which
the inputs produce a known set of outputs. Each case in the training set
is compared with the known outcome; if it differs, a correction is calculated
and applied to the processing in the nodes in the network. These steps are
repeated until a stopping condition, such as corrections being less than
a certain amount, is reached.
Neural networks are an opaque process, which means that the resulting model
doesn't have a clear interpretation. It usually is applied without understanding
the reasoning behind its results.
Some algorithms can translate a neural net model into a set of rules that
can help you understand what the neural net is doing. Some proprietary
ne
ural net products have this capability.
Many neural net products are used beyond the business information arena.
Neural networks frequently are applied to more general pattern recognition
problems such as handwriting recognition and interpretation of electrocardiograms.
Decision trees divide the data into groups based on values of the variables.
They use a methodology that resembles the game of 20 Questions. The result
is a hierarchy of if-then statements that classify the data.
For example: If a customer has made 25% fewer cellular calls each month
than the preceding month, for six months, then there is a 60% probability
that the customer is going to drop the service.
There has been a surge of interest in decision tree-based products, primarily
because they are faster than neural networks for many business problems
and easier for users to understand. Pilot Software Inc. in Cambridge, Mass.
,
is adding a decision tree-based data mining tool to its Lightship multidimensional
engine.
Decision trees aren't foolproof, however, and may not work with some types
of data. Some decision trees have problems handling continuous sets of data,
like age or sales, and require that they be grouped into ranges. The way
a range is selected can inadvertently hide patterns. For instance, if age
is broken into a 25- to 34-year-old group, the fact that there is a significant
break at 30 may be concealed. Information Harvester from Information Harvesting
Corp. in Cambridge, Mass., avoids this problem by assigning values to groups
in a fuzzy way-each instance of the same value may assigned to a different
group.
A set of if-then statements can be every bit as obscure as a neural net,
particularly if the condition list is long and complex. Rule induction creates
non-hierarchical sets of conditions, which may overlap. For example, Idis
from Information Discovery Inc. in Hermosa Beach, Calif., does rule i
nduction
by generating partial decision trees, and uses statistical techniques to
choose which apply to the input data.
Mix and Match
Some vendors combine these approaches. A product due in the middle of 1996
from DataMind Corp. in Redwood City, Calif., will reportedly combine the
features of neural networks and decision trees in an attempt to build a
more accurate model and do it faster.
The final type of data mining tool is data visualization software. In some
ways, data visualization is not really a data mining tool, because it only
presents a picture for users to see rather than automating the process.
But the visual representations of as many as four variables in a single
picture presents an enormous amount of information in a very concise fashion.
Essentially, there is a very wide bandwidth of information presentation
to the user that will often make groups stand out as peaks or valleys.
Several vendors offer a suite of products in recognition that different
pro
blems may be best-served by different approaches. For example, Darwin,
coming in the first half of 1996 from Thinking Machines Corp. in Bedford,
Mass., will offer not only the ability to develop models with a neural net
or a decision tree, but also visualization and memory-based reasoning-a
classification method that matches cases to similar records whose outcome
is already known. It also has a genetic algorithm that can be used for optimizing
models.
IBM has also been very active in data mining and has published a considerable
amount of research done at its labs. Much of this research is coming to
market as a data mining tool kit that addresses four of the five common
data mining application types- classification, clustering, sequencing, and
associations.
Chopping It Up
One potentially serious problem in data mining is the necessity to subset
data for performance reasons. You may have to trade off the number of rows
in your sample against the number of variables you evaluate to
build your
model.
For example, Idis' and Information Harvester's performance are both approximately
linear as the number of rows increases, but deteriorate as the number of
variables increases.
A group of products from Cross/Z International Inc. in Mitchell Field, N.Y.--called
the Fractal Data Mining System--can mine sets of data whose size is virtually
independent of the number of rows in the database. However, it is limited
to about 12 variables with a total combination of values less than
1 billion. The company's products do this through a two-step representation
and compression process.
First, the data warehouse is distilled into a file of query results by asking
all possible combinations of questions for all variables of interest. For
example, suppose you had a database with columns of sex, state of residence,
age, and income, with one row for each individual. The system asks questions
like, "How many males in Alaska are age 26?" and also looks for
the counts with ag
e 27, 28, and so on. It changes the value of every variable
until all values have been exhausted. Values like income in this example
are used as the basis for metrics, which are kept along with the counts
as summaries (e.g., sum or average) in answer to the questions.
The resultant database is typically much smaller than the original data
set, because the answers to the questions usually take up much less space
than the data itself. In this example, there are only about 20,000 combinations
of age, state, and sex, so regardless of whether there were 100,000 rows
or 100 million rows, the size of the data set would be about the same.
On top of this, fractal compression is applied to further reduce file size.
This
compressed file is queried directly. The results can be used on a PC or
even stored
on a diskette.
It is important to remember that data mining is not magic. Buying a $99
neural
net program and thr
owing it against a terabyte data warehouse is not likely
to produce any useful results. It will take forever to get an answer, and
that answer will probably
be worthless. But when applied properly, data mining can produce the return-on-investment
from your data warehouse that you've been waiting for.
Herb Edelstein is a partner in Euclid Associates,
a data warehousing and data mining consulting firm in Potomac, Md. He can
be reached at 73377,1547@compuserve.com