Software // Information Management
News
12/23/2004
01:38 PM
Connect Directly
RSS
E-Mail
50%
50%

A Smart Response to Bad Data

DataFlux dfPower Studio 6.2 data quality software offers profiling, analysis, matching and correction.

PROS
•Packages a comprehensive suite of tools
•Spots similar data using "smart clustering"
•Offers quick fixes and deeper solutions
•Includes an array of prebuilt content
•Supports Web services
CONS
•Poor documentation
•Confusing organization of components
•Available only as a fat client
While companies are still warming up to data quality technologies, solutions providers are delivering increasingly comprehensive and mature products. dfPower Studio 6.2 data quality software is a solid product from a solid company, offering the right mix of tools for the right situation. Developed by DataFlux, a subsidiary of SAS, Power Studio offers an integrated suite of components and capabilities, including profiling, analysis, matching, correction and monitoring.

It's a common (if debatable) axiom that what can't be measured can't be improved, and data quality is no exception. Although companies continuously take steps to improve quality, those upgrades and fixes are often ad hoc and reactive: You see a specific instance of "garbage out" floating on the surface of corporate information (in a user interface or report, for example), and you bravely wade into the database to extract the corresponding "garbage in."

Any organized attempt to improve data quality must begin with an assessment of what you already have, which can happen at multiple levels. The simplest level is profiling: a way to analyze table- and column-level data using a set of metrics that helps you learn from the data distribution. Power Studio supports profiling by providing row count, null count, maximum and minimum values, statistical metrics and pattern analysis. (See the screenshot below.)

Power Studio supports data profiling against metrics including row event, null count, maximal and minimal values, statistical metrics and pattern analysis.

Power Studio has two components for enhanced data assessment: dfPower Studio Base-Analysis and dfPower Match. The Analysis component provides a "smart clustering" process that lets users identify and address consistency in a given data set. With smart clustering, similar data elements and phrases are grouped together to identify the inconsistencies. Smart clustering, for example, would determine that "New Jersey," "Jersey," and "Jersee" are really the same.

To clean up these inconsistencies, combine use of smart clustering with dfPower Match, which helps identify duplicate or near-duplicate data using match definition and sensitivity. Match definitions are predefined algorithms that let you specify the type of match you want to find. For example, a name column in a table might match to a full name, a first name or a last name. Sensitivity specifies the required precision.

Better Than a Band-Aid

After assessing the nature and extent of quality problems, the next step is to correct them. After selecting a data source and the appropriate metrics, for example, you can execute a job that implements the process and displays the results. This approach is useful if you're looking to apply a Band-Aid to your data quality problems.

For larger, more formal quality initiatives, use the dfPower Architect component to create a job stream tapping multiple Power Studio features. (See the screenshot on the right.) dfPower Architect is similar to an extract, transform, load (ETL) tool, but with limited capabilities for data transformation and manipulation. (ETL Studio, a separate SAS product, offers more comprehensive data transformation capabilities.)

The strength of any data quality tool resides in the number of prebuilt features that support rapid application development. Power Studio scores with numerous metrics, data analysis tools and the ability to augment algorithms. It also includes a component called dfPower Verify that provides address and telephone number verification, correction and enhancement, including certified verification for addresses in the United States and Canada.

dfPower Architect lets you create a job stream using integrated features from other components.
Two other features that deserve a look are Power Studio's Quality Knowledge Base (QKB) and data monitoring. QKB is a downloadable set of files that all Power Studio components use for analysis, parsing, standardization and matching. Fresh downloads of the QKB give you access to the most current versions of these files at all times.

Data monitoring is new to Power Studio. With monitoring, you can use metrics to set up alerts or view changes in data. Monitoring lets users be proactive about quality: Now, data quality problems come looking for you, rather than the other way around. Clearly, DataFlux is on to a good thing here. Expect this feature to continue getting better in future releases.

Power Studio is now available only in a fat client/server version, although a browser version is expected next year. Power Studio's core back-end functionality is also available in a low-level set of library routines (called Blue Fusion) or as a server process on Windows or Unix. Back-end functionality also can be integrated with other products, such as SAS's ETL Studio or enterprise transactional systems from Siebel and SAP. Power Studio supports Web services through XML/SOAP.

The Upshot

Power Studio is easy to install and run. I am, however, disappointed with the confusing way the components are organized. The user manual has a poorly structured table of contents and lacks an index, so information is hard to find. These shortcomings will steepen your learning curve, taking away from a solution that's otherwise effective and easy to use.

• dfPower Studio 6.2 is available from DataFlux, 877-846-3589, www.dataflux.com. Pricing for a typical installation starts at $50,000. Minimum system requirements: Windows NT 4.0, 2000, XP, 98 or ME; Pentium II 450MHz; 64MB RAM; 200MB disk space. Recommended: Win2000/XP, Pentium 4 1.8GHz or faster, 512MB RAM and 10+GB disk space.

QUALITY COUNTS
The Data Quality Act (DQA), a U.S. law enacted two years ago, is aimed at "ensuring and maximizing the quality, objectivity, utility and integrity of information." The DQA applies only to information disseminated by federal agencies, but this mandate would work just as well for the data held in corporate and other databases. In private-sector settings unaffected by this law, there's little debate over the need for data quality. Unfortunately, despite this acceptance and the abundance of available tools, data quality is often an afterthought in corporate thinking, planning and budgeting. Wouldn't it be nice if corporate boards were required to pass similar data quality acts?

RAJAN CHANDRAS is a principal consultant with the New York offices of CSC Consulting (www.csc.com).

Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.