SmartAdvice: Clean And Manage Company Data To Learn What Information You've Got
It's worth the effort to clean data files and manage your company's data before it manages you, The Advisory Council says. Also, adopt enforceable HR policies to discourage users from downloading programs that may contain malware; and five steps to consider in launching a mentoring program.
Editor's Note: Welcome to SmartAdvice, a weekly column by The Advisory Council (TAC), an advisory service firm. The feature answers three questions of core interest to you, ranging from career advice to enterprise strategies to how to deal with vendors. Submit questions directly to email@example.com
Question A: We're plagued by data quality problems such as inconsistent, incorrect, and redundant customer data (e.g., multiple records for the same customer, but with different spellings). What can we do about it?
Our advice: Murphy's law, applied to data, says that if the same data is stored in two different places, the information will be become inconsistent. Traditionally, redundancy and data hygiene aren't carefully controlled.
Modern enterprise systems address this problem by implementing data-synchronization procedures between integrated systems via publish-and-subscribe or request-and-respond interfaces or, better yet, a single instance of truth across all applications. Modern database systems incorporate data-integrity subsystems to manage data of all types and metadata in the repository, including:
Intra-record integrity to enforce constraints on data item values and types;
Referential integrity to enforce the validity of references between records; and
Concurrency control for multiple users.
Yet the problems of data hygiene persist and are pervasive due to mergers, changes in business requirements or business processes, or just statistically based on the growing volume of data. The best first step on the road to recovery is to assess the quality the data from the perspective of whether the data is good (valid) or bad (invalid) via a systematic audit process. Validity is a measure of relevance of the data to the process or analysis at hand.
The next step is to prioritize the data into A, B, and C priorities:
A-priority data must contain close to zero defects (e.g., error or omission could have high cost of failure). For example, the misspelling of a customer name could cost loss of the customer;
B-priority data is important second-priority data. For example, a misspelling in a catalog product description may be embarrassing, but it wouldn't detract from understanding the product; and
C-priority data is optional or noncritical data where the cost of omission and error is marginal, such as demographic data gathered only for statistical aggregation.
You must address cleansing of A-priority data. Then address B-priority data as resources allow, and as appropriate based a cost-benefit analysis.
Data scrubbing, also called data cleansing, is the process of amending or removing data that is incorrect, incomplete, improperly formatted, or duplicated. Using a data-scrubbing tool can save a significant amount of time and can be less costly than fixing errors manually.
Begin with a small sample of 50 to 100 occurrences of the A- and B-priority data and measure it for accuracy to get an idea of the extent of any accuracy problems.
Matching data redundantly stored in disparate databases is one of the painful data-cleansing problems. You should first seek to consolidate any duplicate records within a single file or database. Keep a cross-reference table to relate the surviving "occurrence-of-record" to the records that previously existed. This is used to redirect any business transaction using "old" identifiers to the occurrence-of-record. Also maintain an audit file with before and after images of the data to assure you can reconstruct the original records.
Then de-duplicate and consolidate the records within all the other redundant files, selecting the most reliable values for propagation. Correct and synchronize data values at each source for consistency to the extent possible. Maintain your cross-reference table of related occurrences.
The goal of data management is to provide the infrastructure to transform raw data into consistent, accurate, and reliable corporate information. Its foundation consists of a two-step process:
Data profiling -- Understanding the quality of the data you have; and
Data cleansing and integration -- Combining similar data from multiple sources.
5 Top Federal Initiatives For 2015As InformationWeek Government readers were busy firming up their fiscal year 2015 budgets, we asked them to rate more than 30 IT initiatives in terms of importance and current leadership focus. No surprise, among more than 30 options, security is No. 1. After that, things get less predictable.
Join us for a roundup of the top stories on InformationWeek.com for the week of December 7, 2014. Be here for the show and for the incredible Friday Afternoon Conversation that runs beside the program!