A Question of Quality

Data quality is today's obsession. How good does the data really have to be?

Joe Celko, Contributor

August 25, 2004

4 Min Read

Everyone ought to go to the Wilshire Meta-data Conferences — not just for the excellent sessions, but also for the quick ideas and little stories that get tossed out at you so fast that you need a week or two to digest them.

The big thing right now is DQ (data quality, not Dairy Queen, in case you've been on Mars). Being a cynic, I think that its new popularity is because, thanks to Sarbanes-Oxley and more audits, management is worried about jail time.

The little toss-off I'm currently playing with came from Graeme Simsion at a panel discussion at this year's Meta-Data Conference (May 2-6, 2004). I assume you know the Zachman Framework, which can also be a general purpose mental tool for setting up a problem for a solution, not just an IT architecture thing. In 25 words or less, you build an array with English language interrogatives (who, what, where, when, why, and how) on one axis and the parts of your project on the other axis.

What Graeme pointed out is that other languages have different interrogatives that can go into this array. In particular, the French also ask about quantity and cost (how many/much?) and quality (how good?). Now, I have a quality dimension at the level of my metadata design that needs to be part of the data dictionary. I would also add "how trusted?" to the list. (Maybe it's an interrogative in Klingon?)

Return to Sender

Let me make this more concrete with the classic source of bad data — mailing addresses. When I'm setting up the data dictionary for the addresses, if I use Graeme's model as part of the high-level design process, then I have a cell for quality in the array to fill out. It has two parts: How good is the data I get? How good do I need it to be?

Because it's now at the top level, I have to have mechanisms all the way down to the implementation level to enforce it. Quality is getting built into the system and not stuck on later.

If I'm doing bulk mailings, then perhaps I can live with a 2 to 3 percent bad address rate and slower delivery. The cost of cleaning the data is greater than the return I would get from doing it. But if I'm sending out lab results to doctors, I need it to be 100 percent accurate and fast so I don't kill people. A large quantity leads to a lower quality in this example.

But some data "spoils" over time. People move, change jobs, change income, change health, and so forth over periods of time. But what is the speed of decay? Are we talking about milk or pickles in the data refrigerator?

The idea of spoilage lets me get a measure of the quality as it relates to time. If my mailing list is very old, I expect to have a lot more returns than a mailing list I compiled last week. If I track the spoilage so that I have a more precise idea that about x percent of the addresses go bad in k weeks, I have a way to schedule a data cleanup.

The trust dimension is harder to define. The source and age of the data are obvious factors. For example, people lie about their age and their income. If I have an outside verification from a trusted source, like a birth certificate or a tax return, then I'm more confident about my data. If I'm doing a magazine survey, I can stand a little fibbing in my data and adjust for it in the aggregate. If I'm doing a security check, then there's no such thing as a "little white lie"; it's perjury and has penalties.

Work in Progress

I'm still playing with these ideas. My first impulse is try to come up with some kind of statistical measurement for each data element that will look impressive because it has a lot of decimal places. Of course, this quickly leads to a DQ audit system that's more complicated than the underlying database.

What if we started with a "traffic light" code or five-point scale in the data dictionary to stand for quality? Nothing fancy, just an estimate of the data's quality. Then add another scale for spoilage rate expressed on another simple scale — verify once a decade, once a year, monthly, weekly, or whatever. What do you think?

Joe Celko is an independent consultant in Austin, Texas and the author of Joe Celko's Trees and Hierarchies in SQL for Smarties (Morgan Kaufmann, 2004) and Joe Celko's SQL for Smarties: Advanced SQL Programming

Read more about:

20042004

About the Author(s)

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights