Understand both business and technology
By Herb Edelstein
Issue Date: Jan. 8, 1996
The key to data mining success is understanding both the business problem
and the technology. All too often, people try to apply a data mining tool
blindly on a database and expect to get a usable result. Without an understanding
of the problem domain, it's very easy to misuse a data mining product.
For example, many products require you to use portions of the available
data. These portions may be a subset of the rows (a sample of the data),
a selection of columns (variables), or both. You can only properly choose
this subset if you understand your problem and data. Some products will
try to automatically sample the data-the cautious user must understand whether
the basis the product uses for row selection will give him the desired result.
Furthermore, not every approach or algorithm is appropriate for every problem.
Understanding the limitations of algorithms and how they use data is essential
in interpreting results. A result that is too close to perfect may indicate
the pattern you are searching for is already coded into the data in a disguised
format-for example, a variable is dependent on a value calculated from other
parts of the database.
Perhaps the biggest problem in data warehousing in general and data mining
in particular is the quality of the data. One of the earliest principles
of data processing applies here: garbage in, garbage out. It is absolutely
critical to ensure that the data is as clean as possible and has
as few
missing values as possible. Because there inevitably will be a certain amount
of bad and missing data in the data warehouse, you will need to understand
how this can affect results.
If your model is highly sensitive to a particular variable, you should make
sure that small amounts of incorrect or missing data in that variable haven't
yielded skewed results. You must continually monitor data quality as you
add data to your warehouse.
A formal examination of your data can help you build your model and improve
its quality. This can range from a series of queries to some preliminary
data mining.
No matter how good your model, it will likely change over time. A classification
scheme that works in an era of 3% inflation may not be as effective in an
era of 6% inflation.