Enterprises aspire to derive insights from data that can provide a competitive advantage. The most common impediment to achieving this goal is poor data quality. If the data that is being input to a predictive algorithm is “dirty” (with missing or invalid values), then any insights produced by that algorithm cannot be trusted.
To achieve data quality, it’s not enough to clean up the existing historical data. You also need to ensure that all newly generated data is clean by instituting a set of capabilities and processes known collectively as data governance. In a governed data environment, each type of data has a data steward who is responsible for defining and enforcing criteria for data cleanliness. And, each data value has a clearly defined lineage: We know where it came from, what transformations it underwent along the way, and what other data items are derived from this data value.
Data lineage provides an enterprise with many benefits:
It sounds great, but where does data lineage information come from? Looking at a specific data value in the database tells us its current value, but it will not provide information about how the data evolved into its current value. What is missing is data about the data (lineage metadata) that automatically remembers the time and source of every change made to every data item, whether the change was made by software or by a human database administrator.
There are three competing techniques for collecting lineage metadata, each of which has its strengths and weaknesses:
1. Decoded lineage
Rather than examining the data values of schemas to look for similarities, this approach focuses exclusively on the code that manipulates the data. Tools in this category (MANTA, Octopai, Spline) scan all the logic to understand it and reverse engineer it, to build an understanding of how data changes and which data serves as input for calculating other data. This approach provides the most accurate, complete and detailed lineage metadata, as every single piece of logic is processed. But it has some weaknesses:
2. Data similarity lineage
This approach builds lineage information by examining data and schemas without accessing your code. Tools in this category (Tamr, Paxata, Trifacta) profile data in your tables and read database metadata about tables, columns, etc., then use all that information to create lineage based on similarities. On the one hand, this approach will always work regardless of your coding technology, because it analyzes the resulting data, regardless of which technology generated the data. But it has several glaring weaknesses:
3. Manual lineage mapping
This approach builds lineage metadata by mapping and documenting the business knowledge in people’s heads (for example, talking to application owners, data stewards and data integration specialists). The advantage of this approach is that it provides prescriptive data lineage (how data should flow as opposed to how it flows after implementation bugs). But, because the metadata is based on human knowledge, it is likely to be contradictory (because two people disagree about the desired data flow) or partial (If you do not know about the existence of a data set, you will not ask anyone about it).
As you can see, there is no magic bullet -- each approach has its strengths and weaknesses. The best solution combines all three approaches.
Once you successfully combine these techniques, you can collect the comprehensive lineage metadata you need to start enjoying the benefits of governed data.
Moshe Kranc is the chief technology officer at Ness Digital Engineering. Kranc has extensive experience in leading adoption of bleeding-edge technologies, having worked for large companies as well as entrepreneurial start-ups. He previously headed the Big Data Center of Excellence at Barclays’ Israel Development Centre (IDEC). He has worked in the high-tech industry for over 30 years in the U.S. and Israel. He was part of the Emmy award-winning team that designed the scrambling system for DIRECTV, and he holds 6 patents in areas related to pay television, computer security and text mining.
The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT ... View Full Bio