SmartAdvice: Clean And Manage Company Data To Learn What Information You've Got - InformationWeek
11:00 PM

SmartAdvice: Clean And Manage Company Data To Learn What Information You've Got

It's worth the effort to clean data files and manage your company's data before it manages you, The Advisory Council says. Also, adopt enforceable HR policies to discourage users from downloading programs that may contain malware; and five steps to consider in launching a mentoring program.

Editor's Note: Welcome to SmartAdvice, a weekly column by The Advisory Council (TAC), an advisory service firm. The feature answers three questions of core interest to you, ranging from career advice to enterprise strategies to how to deal with vendors. Submit questions directly to

Question A: We're plagued by data quality problems such as inconsistent, incorrect, and redundant customer data (e.g., multiple records for the same customer, but with different spellings). What can we do about it?

Our advice: Murphy's law, applied to data, says that if the same data is stored in two different places, the information will be become inconsistent. Traditionally, redundancy and data hygiene aren't carefully controlled.

Modern enterprise systems address this problem by implementing data-synchronization procedures between integrated systems via publish-and-subscribe or request-and-respond interfaces or, better yet, a single instance of truth across all applications. Modern database systems incorporate data-integrity subsystems to manage data of all types and metadata in the repository, including:

  • Intra-record integrity to enforce constraints on data item values and types;

  • Referential integrity to enforce the validity of references between records; and

  • Concurrency control for multiple users.

Yet the problems of data hygiene persist and are pervasive due to mergers, changes in business requirements or business processes, or just statistically based on the growing volume of data. The best first step on the road to recovery is to assess the quality the data from the perspective of whether the data is good (valid) or bad (invalid) via a systematic audit process. Validity is a measure of relevance of the data to the process or analysis at hand.

The next step is to prioritize the data into A, B, and C priorities:

  • A-priority data must contain close to zero defects (e.g., error or omission could have high cost of failure). For example, the misspelling of a customer name could cost loss of the customer;

  • B-priority data is important second-priority data. For example, a misspelling in a catalog product description may be embarrassing, but it wouldn't detract from understanding the product; and

  • C-priority data is optional or noncritical data where the cost of omission and error is marginal, such as demographic data gathered only for statistical aggregation.

You must address cleansing of A-priority data. Then address B-priority data as resources allow, and as appropriate based a cost-benefit analysis.

Related Links

Advanced Data Cleansing with Oracle9i Warehouse Builder

Trillium Software Data Analytics

Initiate Customer Data Integration

Data scrubbing, also called data cleansing, is the process of amending or removing data that is incorrect, incomplete, improperly formatted, or duplicated. Using a data-scrubbing tool can save a significant amount of time and can be less costly than fixing errors manually.

Begin with a small sample of 50 to 100 occurrences of the A- and B-priority data and measure it for accuracy to get an idea of the extent of any accuracy problems.

Matching data redundantly stored in disparate databases is one of the painful data-cleansing problems. You should first seek to consolidate any duplicate records within a single file or database. Keep a cross-reference table to relate the surviving "occurrence-of-record" to the records that previously existed. This is used to redirect any business transaction using "old" identifiers to the occurrence-of-record. Also maintain an audit file with before and after images of the data to assure you can reconstruct the original records.

Then de-duplicate and consolidate the records within all the other redundant files, selecting the most reliable values for propagation. Correct and synchronize data values at each source for consistency to the extent possible. Maintain your cross-reference table of related occurrences.

The goal of data management is to provide the infrastructure to transform raw data into consistent, accurate, and reliable corporate information. Its foundation consists of a two-step process:

  • Data profiling -- Understanding the quality of the data you have; and

  • Data cleansing and integration -- Combining similar data from multiple sources.

Begin with discovery; end with enlightenment!

-- Peter Taglia

1 of 3
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
Digital Transformation Myths & Truths
Transformation is on every IT organization's to-do list, but effectively transforming IT means a major shift in technology as well as business models and culture. In this IT Trend Report, we examine some of the misconceptions of digital transformation and look at steps you can take to succeed technically and culturally.
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll