Thomas Friedman's wonderful book, The World is Flat, chronicles a revolution that most of us in IT are well aware of. Our enterprises collect and process data from around the world. We have hundreds or even thousands of suppliers, and we have millions of customers in almost every country. Our employees, with their attendant names and addresses, come from every conceivable culture. Our financial transactions are denominated in dozens of currencies. We need to know the exact time in remote cities. And above all, even though thanks to the Web we have a tight electronic connection to all of our computing assets, we are dealing with a profoundly distributed system. This, of course, is the point of Friedman's book.
Data quality is enough of a challenge in an idealized mono-cultural environment, but it is inflamed to epic proportions in a flat world. But strangely, the issues of international data quality are not a single coherent theme in the IT world. For the most part, IT organizations are simply reacting to specific data problems in specific locations, without an overall architecture. Is an overall architecture even possible? This article examines the many challenges surrounding international data quality and concludes with eight recommendations for addressing the problem.
Languages and Character Sets
Beyond America and Western Europe there are hundreds of languages and writing systems that cannot be rendered using a single-byte character set such as ASCII. The Unicode standard, of course, is the internationally agreed-upon multi-byte encoding intended to handle all the writing systems on the earth. The latest release, Unicode 5.1, encodes 100,715 characters in virtually every modern language. It is important to understand that Unicode is not a font. It is a character set. The architectural challenge for the data warehouse is to ensure that there is end-to-end support for Unicode all the way from data capture, through all forms of storage, DBMSs, ETL processes, and finally the report writers and BI tools. If any one of these stages cannot support Unicode, the final result will be corrupted and unacceptable.
Cultures, Names and Salutations
The handling of names is a sensitive issue, and doing it incorrectly is a sign of disrespect. Consider the following examples from different cultures:
Brazil: Mauricio do Prado Filho
Singapore: Jennifer Chan-Lee Bee Lang
USA: Frances Hayden-Kimball
Are you confident that you can parse these names? Where does the last name start? Is Frances male or female? Some years ago, my title was Director of Applications. I received a letter addressed to "Dir of Apps", which began with "Dear Dir." I didn't take that letter very seriously!
Geographies and Addresses
Addresses in different countries are notoriously difficult to parse without detailed local knowledge. Consider the following examples:
Finland: Ulvilante 8b A 11 P1 354 SF-00561 Helsinki
Korea: 35-2 Sangdaewon-dong Kangnam-ku Seoul 165-010
Again, do you have any idea how to parse these addresses?
Privacy and Information Transfer
Even if the data you collect is properly parsed and of high quality, you need to be very careful with how you store, transport, and expose that data. France's Act of 6 January 1978 on Data Processing, Files, and Individual Liberties, amended August 2004 and March 2007, states, "The collection and processing of personal data that reveals, directly or indirectly, the racial and ethnic origins, the political, philosophical, religious opinions or trade union affiliation of persons, or which concern their health or sexual life, is prohibited. (8 paragraphs of exceptions follow)." Search the term "privacy law" on Google for much more on this topic.
Compliance is another migraine headache for the data warehouse whenever revenue or profitability data is exposed through BI tools. One of the modules in Kimball University data warehouse classes is how to allocate costs in an organization in order to compute profit. Be careful! The European Union has 25 member states, each with potentially varying financial responsibility guidance.
Transaction systems normally will capture detailed financial transactions in the true original currency at the location of the transaction. Different currencies, of course, cannot be directly added. Exchange rates change every day, in some cases rapidly. Foreign currency symbols are essential in final reports, but may not be available in the fonts you use.
Time Zones, Calendars and Date Formats
Contrary to popular belief, there are not just 24 time zones around the world, but hundreds! The complexity comes from daylight savings time rules. For example, although the state of Indiana is entirely in the Eastern time zone, part of Indiana observes daylight savings time and part does not. You need a list of Indiana counties to know what time it is in Kokomo! In some areas of the world, there are dozens of jurisdictions with different time-zone rules.
In western countries, most of us use the Gregorian calendar, but there are several other important calendars. For example, July 8, 2008 in the Gregorian calendar is 6-6-4705 in the Chinese calendar; 6-6-2668 in the Japanese calendar; Rajab 4, 1429 in the Muslim calendar; and Tammuz 5, 5768 in the Talmudic calendar. Can your data warehouse handle these? And if a European writes "7-8-2008," is this July 8 or August 7?
One might think that at least with simple numbers, nothing could go wrong. But in India and other parts of central Asia the number "12,12,12,123" is perfectly legitimate and corresponds to "121,212,123" in the United States. Also, in many European and South American countries, the role of the period and the comma for designating the decimal point is reversed from the United States. You better get that one right!
Architectures for International Data Quality
Here, in condensed form, are my recommendations for addressing international data quality:
1. 90 percent of data quality issues can be addressed at the source, and only 10 percent further downstream. Addressing data quality at the source requires an enterprise data quality culture, executive support, financial investment in tools and training, and business process re-engineering.
2. The master data management (MDM) movement is hugely beneficial for establishing data quality. Build MDM capabilities for all your major entities including customers, employees, suppliers, and locations. Make sure that MDM creates the members of these entities upon demand, rather than cleaning up the entities downstream. Use MDM to establish master data structures for all your important entities. Make sure the deployment lets you correctly parse these entities at all stages of the DW/BI pipeline, carrying the detailed parsing all the way to the BI tools.
3. Actively manage and report data quality metrics with data quality screens, error event schemas, and audit dimensions (read my white paper, "Architecture for Data Quality in an Enterprise DW/BI System").
4. Standardize and test Unicode capability through your DW/BI pipelines.
5. Use www.timezoneconverter.com at the time of data capture to determine the actual time of day of every transaction that occurs in a remote foreign location. Store both universal time stamps and local time stamps with every transaction.
6. Choose a single universal currency (dollars, pounds, euros, etc.) and store both the local value of a financial transaction together with the universal currency value in every low-level financial transaction record.
7. Don't translate dimensions in your data warehouse. Settle on a single, master language for dimensional content to drive querying, reporting and sorting. Translate final rendered reports, if desired, in place. For hand-held device reporting, be aware that most non-English translations result in longer text than English.
8. Don't even think about establishing privacy and compliance best practices. That is a job for your legal and financial executives, not for IT. You do have a CPO and a CCO (Privacy and Compliance, respectively), don't you?
For more information, please consult these additional references, which detail the approaches I use for addressing international data quality. The best reference for understanding international data representation issues is Merriam-Webster's Guide to International Business Communications, Second Edition, by Toby Atkinson. I have written two relevant white papers: Architecture for Data Quality in an Enterprise DW/BI System and Architecture for Integration in an Enterprise DW/BI System. These white papers are sponsored by Informatica but are free from vendor product recommendations.