Silk Purses Out of Sows' Ears
Let's suppose that you're doing everything right, that you have data quality adequate for your needs and suitable, managed methods, algorithms, and implementations. There may be additional steps you can take to improve accuracy.
I attended a very interesting talk on "Information Awareness: A Prospective Technical Assessment," presented by David Jensen last August at a conference of SIGKDD, the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining. Jensen and coauthors Matthew Rattigan and Hannah Blau ("Information awareness: A prospective technical assessment," kdl.cs.umass.edu/papers/jensen-et-alkdd2003.pdf) all from the computer-science department at the University of Massachusetts, put forth the thesis that the inaccuracy of national-security programs such as CAPPS-II an airline-passenger screening program that would score air travelers according to a statistical model designed to profile potential terrorists can be reduced by reframing modeling assumptions and reworking analytic processes. Their prescriptions can surely be applied to many different kind of problems. In this particular case, a rate of just a fraction of 1 percent false positives would mean many thousands of erroneously flagged travelers, gumming up security procedures and reducing the odds of catching actual bad guys.
The authors would start by assuming "relational" data sources "providing connections between individual data records" rather than "propositional" sources "in which each instance is characterized by a set of simple propositions that is, age=32, gender=male)" where "each individual is assumed to be statistically independent of any other." That is, expect and exploit statistical patterns in your data. Such patterns segments and clusters may be formed by correlations over time, in demographic or other characteristics, or in outcome.
Secondly, the authors believe that using rankings rather than binary classifiers (such as decision variables), where the outputs are scores rather than overly simple true or false values, can also significantly reduce the error rates and boost accuracy. I interpret this prescription as also calling for weighting the reliability of contributing factors in creating derived indicators on which decisions are based. That is, move away from deterministic models that depict a complex world without nuance, in black and white.
Lastly, they advocate use of multipass rather than single-pass inference. Anything worth doing is worth doing twice, so check your work, and do it differently the second (and third) time. You may tease additional information out of a system that refines results and therefore improves their accuracy.
Accuracy doesn't happen by itself, rather it's an element to build into operations in the name of quality, tempered by the precision required to achieve enterprise goals. Techniques should not be viewed as an end in themselves data quality, for instance, is not an absolute but more appropriately as factors in a larger, overall picture. That dynamism of the big picture dictates managed models that accommodate changing conditions without degradation in accuracy, models that may internalize checks and balances. The accuracy effort will pay off through support for automated decision-making, which is a growing enterprise imperative, like it or not.
Seth Grimes, [[email protected]] is a principal of Alta Plana Corp., a Washington, D.C.-based consultancy specializing in large-scale analytic computing systems.
Accuracy vs. Precision
The terms accuracy and precision are sometimes carelessly used interchangeably. Accuracy describes exactness while precision refers instead to the sensitivity of the measurement, estimate, or computation. If the odometer in my car records 101.2 miles after I've driven only 100.0, its precision is nonetheless in tenths of a mile even while it is about 1 percent inaccurate, with a sensitivity an order of magnitude less than its precision.
An exchange between Captain Kirk and first officer Spock in the Star Trek episode "Errand of Mercy" illustrates the confusion:
Kirk: What would you say the odds are on our getting out of here?
Spock: It is difficult to be precise, Captain. I should say approximately 7,824.7 to one.
Kirk: Difficult to be precise? 7,824 to one?
Spock: 7,824.7 to one.
Kirk: That's a pretty close approximation.
Spock: I endeavor to be accurate.
Kirk: You do quite well.
Although precise to five places, Spock's approximation was really quite inaccurate since the pair did get out of that particular scrape. Spock could perhaps have done better computing the odds from prior probabilities, using Bayesian statistics, given Captain Kirk's past stellar performance in tight spots.
I'm preparing a research study on integration of text mining with traditional techniques for analyzing numeric data. Text mining promises to unlock useful, usable knowledge that is hidden away in unstructured documents. It may prove especially powerful when linked with data mining and BI techniques. If you're doing this sort of integrated analytics, thinking about it, or just plain interested in the subject, drop me a line. I'd like to hear your thoughts.
Additional Columns at IntelligentEnterprise.com:
"The Word on Text Mining," Dec. 10, 2003:
"Futures Shock," Oct. 10, 2003:
Visit the Business Intelligence InfoCenter at www.intelligententerprise.com/info_centers/bi/
Check out Intelligent Enterprise's Playbook "Complying with Sarbanes-Oxley":
With their integrity on the line, CEOs, CFOs, and other corporate officer mean business when they demand "confidence in the numbers." but where can top executives and IT managers turn for answers?