Kimball University: Using Regular Expressions for Data Cleansing and Standardization

Are you struggling to cleanse or standardize company names, phone numbers or other data in your warehouse? Here's how to use regular expressions to improve data matching, parse complex string patterns and increase analytic accuracy.
The Final Results

This simple expression will match six different ways to type incorporated, plus differences in spacing. The ETL process would use it in a regular expression Replace function to add a standardized version, like ", Inc.", or replace it with an empty string, like the following results set:


Not only does this standardization increase the likelihood of a match, it also improves the analytics. A count of registrations by the standardized Company is much closer to reality. This result is still not perfect; you might be combining companies that are not the same. Remember, matching is all about probabilities.

Where Can You Use RegExps?

Regular expressions can be used to help standardize any text or number column. You can also use them to extract substrings, like the search string component of a complex URL. There are examples of regular expressions to standardize or parse out most common data entry problems. This one works on United States phone numbers (\d stand for any digit):

"(\d{3})([-\ )]|\.)*\(*(\d{3})([-\ )]|\.)*(\d{4})"

You can find regular expression engines in most development toolkits, such as Microsoft's .Net regular expressions library (System.Text.RegularExpressions class), the Java JDK, and in many scripting languages like PHP, Python, and PowerGREP. Many UNIX utilities have regular expression engines built in as well. Your ETL tool may even offer direct access to regular expressions as part of its toolkit. The free version of EditPad Pro is a nice tool for interactive development and testing of your RegExp patterns.

If you have big data integration problems there are more sophisticated tools to do matching, including name and address lookups, fuzzy matching algorithms, and standardization libraries. However, if you have a relatively small data cleansing or standardization problem, or need to parse out complex string patterns, regular expressions can be a big help.

Editor's Choice
Samuel Greengard, Contributing Reporter
Cynthia Harvey, Freelance Journalist, InformationWeek
Carrie Pallardy, Contributing Reporter
John Edwards, Technology Journalist & Author
Astrid Gobardhan, Data Privacy Officer, VFS Global
Sara Peters, Editor-in-Chief, InformationWeek / Network Computing