Welcome Guest. | Log In| Register | Membership Benefits
News

April 10, 2000

Printer ready
Printer ready

Data Cleansing Helps E-Businesses Run More Efficiently

Data mining and E-commerce have exposed problems with the management of information

By Mike Faden

Illustration by Timothy Cook
Related links:

  • Companies See Gold In Outside Data Analysis (3/20/00)

  • New Patterns In Your Data (3/20/00)
  • TechEncyclopedia
    Need a definition of a technology term? Look it up here:


    Send Us Your Feedback
    An E-business can't run without accurate data about its customers and business partners, and that need is elevating data cleansing from an obscure, specialized technology to a core requirement for data warehousing, customer-relationship management, and Web-based commerce.

    This month, for instance, 3M Corp. plans to roll out a service that will let the manufacturer's various divisions check addresses and other information about customers. Employees can send files to be validated against a company database of trading partners using data-cleansing software from Trillium Software, a division of Harte-Hanks Data Technologies. Address information held by multiple divisions can easily get out of sync, says Wes Hillman, IT project manager at 3M, in St. Paul, Minn. "Street names get changed at the whim of a community," he says. "And people abbreviate and mistype information."

    3M divisions in Europe, the Pacific Rim, and other parts of the world are asking for the new service. "It's a case of 'build it and they will come'--and they are coming," Hillman says. The surge in interest is largely driven by E-business initiatives such as online commerce across the company's many divisions, he notes, and the accompanying need to get a single, accurate view of 3M's trading partners--no easy feat in a $15 billion company that sells 50,000 products in 200 countries.

    Many other companies are waking up to the broader uses of data cleansing, a collection of technologies designed mainly to clean up address lists used for direct-mail campaigns--and still widely used for that purpose. The software handles the complex problems of making sure addresses and other information are accurate and up to date, and checking for duplicate records, such as multiple accounts for the same customer held under slightly different spellings. Increasingly, the software is also used to help identify other relationships in a company's data and for adding information from other sources--such as demographic data or a customer's business background--during the cleansing process.

    Those functions are proving vital in ensuring that data is accurate enough for use in the explosion of data mining and warehousing, CRM, and E-commerce applications, analysts say. "The urgency is that companies are now at the point where they're very preoccupied with studying customer data," says Philip Russom, director of data warehousing and business intelligence at Hurwitz Group. That concern was underlined by a recent InformationWeek Research survey of 300 IT executives, in which more than 80% said improving customer data quality was their top priority.

    Wes HillmanPhoto by Sal Skog Data warehousing has unearthed many previously hidden data-quality problems. Inconsistencies between the way customer information is held by different units may not matter until someone tries to use that data companywide--to analyze who the biggest customers are across the whole organization, for instance. "Most companies have attempted data warehousing and discovered problems as they integrate information from different business units," says Larry English, a data-quality consultant in Brentwood, Tenn. "Data that was apparently adequate for operational systems has often proved inadequate for data warehouses."

    English cites an insurance company that was shocked when it downloaded data from a claims-processing center and found that an incredible 80% of claims apparently involved broken legs. When the insurance company investigated, it found that the code used to indicate a broken leg was the default code in the system used to process claims. And since the claims processors were paid according to how fast they worked, they used the default code to make processing as fast as possible. It didn't matter that the wrong code was used while the claim was being processed, but it caused wildly inaccurate results when the insurance company attempted to used the data in a data warehouse to analyze the patterns of diagnoses for which it was paying claims.

    The emergence of E-commerce has also opened up an "entirely new source of data-quality problems," English says. Traditionally, data has been entered into a company's system by its own employees. For instance, telephone sales representative would enter and check details on a new customer. Now, data may be entered at a Web site directly by a customer, a business partner, or, in some cases, by anyone who visits the site. They're more likely to make mistakes--and less likely to care if they do. "The Web starts to take out 'data intermediaries,'" says Guy Creese, a senior analyst at Aberdeen Group. And that can have disastrous consequences when it comes time for analysis.

    At 3M, data cleansing will make an impact both on data warehousing and new online applications. To date, the company's trading-partners database, built using Sybase Inc.'s DBMS and Sybase's client-server development tool, has been accessed and maintained directly by only a handful of employees. The service being rolled out this month will make the system available across 3M, where it will be used to check and clean up data destined for 3M's corporate data warehouse, Hillman says. Longer term, 3M is developing a Corba-Java application programming interface that Intranet-based or other online applications elsewhere within the company could use to validate information on the fly. Hillman hopes to have the API completed this year.

    In the past, the use of data-cleansing technology has been limited by its high cost. Software from Trillium and other vendors is priced at $100,000 to $300,000 or more, depending on functionality and the platform the software runs on, plus maintenance fees up to 20% per year, English says. Another limiting factor: The packages can be difficult to implement, in line with their use as highly specialized tools, Creese says.

    While director of database management at Ameritech, John Hershberger oversaw a project to cleanse millions of customer names and addresses. He says the project cut the volume of mail sent by 4% and saved about $250,000 a year in the process. The implementation, based on software from Trillium and SAS Institute Inc., cost between $210,000 and $225,000, including worker hours, and took nine months. The company saw a return on its investment very quickly, but the implementation took about twice as long and was twice as expensive as originally expected--and the software also proved a heavy consumer of system resources, especially RAM, says Hershberger, who left Ameritech early this year. He says one time-consuming task was defining the business rules that enable the Trillium software to identify valid names and addresses in ethnically diverse urban areas, where many people's names don't fit typical American conventions and spellings.

    Hershberger used SAS's Enterprise Miner and Warehouse Administrator while building the warehouse to produce statistical reports on the quality of the data before cleansing. Users say this process, often called "investigation," is an important first step in data cleansing. It shows where the problems lie in the data and how extensive those problems are, helping users decide where to focus on quality improvements.

    Pegasus Systems in Dallas supplies reservations and other travel services to hotels. Its Pegasus Business Intelligence unit performs data mining, market research, and other data-analysis functions for Marriott International, Intercontinental, and other companies. Sharon Griffin, Pegasus' business intelligence manager of data warehousing services, says investigation helps her show her clients the types of data problems they have and the importance of training staff to change business practices in order to fix them. One common problem for hotels is that 85% of guest names and addresses are incorrect--hotel staffers are always under pressure to check people in quickly, and they often cut corners when entering guest information.

    Pegasus chose Vality Technology Inc.'s Integrity software to investigate and cleanse data. Pegasus selected the software after conversations with other Vality users led it to estimate that developing the same capabilities in-house would cost three to four times as much. Dave Pittman, director of IT at Pegasus, says Vality's software appeared less rigid than competitors' because it was based on recognizing patterns in the data rather than on the rule-building approach used by others. "We felt it was more flexible," he says.

    continued...page 2

    Illustration by Timothy Cook
    Photo of Hillman by Sal Skog

    Back to This Week's Issue
    Send Us Your Feedback
    Top of the Page

    CAREER CENTER
    Ready to take that job and shove it?



    TechCareers

    SEARCH
    Function:

    Keyword(s):

    State:
    SPONSOR
    RECENT JOB POSTINGS
    CAREER NEWS
    Go beyond Google and get vertical. These specialized search sites will help you find the business information you need -- fast.

    Ari Balogh was named to the post of chief technology officer as the companys for a "realignment" of employees.



    Specialty Resources

    Featured Microsite

     

    Join economist Chris Cornell and 3 CIOs in an Exclusive Online Exchange for Senior IT Executives: Using IT to Drive Value in a Turbulent Economy. November 5th only.