Enterprises aspire to derive insights from data that can provide a competitive advantage. The most common impediment to achieving this goal is poor data quality. If the data that is being input to a predictive algorithm is “dirty” (with missing or invalid values), then any insights produced by that algorithm cannot be trusted.
To achieve data quality, it’s not enough to clean up the existing historical data. You also need to ensure that all newly generated data is clean by instituting a set of capabilities and processes known collectively as data governance. In a governed data environment, each type of data has a data steward who is responsible for defining and enforcing criteria for data cleanliness. And, each data value has a clearly defined lineage: We know where it came from, what transformations it underwent along the way, and what other data items are derived from this data value.
Data lineage provides an enterprise with many benefits:
- The ability to perform impact analysis and root-cause analysis, by tracing lineage backwards (to find all data that influenced the current data) or forwards (to identify all other data that is impacted by the current data) from a given data item;
- Standardization of the business vocabulary and terminology, which facilitates clear communication across business units;
- Ownership, responsibility and traceability for any changes made to data, thanks to the lineage’s comprehensive record of who made what changes and when.
It sounds great, but where does data lineage information come from? Looking at a specific data value in the database tells us its current value, but it will not provide information about how the data evolved into its current value. What is missing is data about the data (lineage metadata) that automatically remembers the time and source of every change made to every data item, whether the change was made by software or by a human database administrator.
There are three competing techniques for collecting lineage metadata, each of which has its strengths and weaknesses:
1. Decoded lineage
Rather than examining the data values of schemas to look for similarities, this approach focuses exclusively on the code that manipulates the data. Tools in this category (MANTA, Octopai, Spline) scan all the logic to understand it and reverse engineer it, to build an understanding of how data changes and which data serves as input for calculating other data. This approach provides the most accurate, complete and detailed lineage metadata, as every single piece of logic is processed. But it has some weaknesses:
- It may not be easy to develop enough support for the dozens of languages that must be analyzed to cover the basics of your environment. It may also prevent you from adopting a new technology because your decoded lineage engine does not yet support it.
- Code versions change over time, so your analysis of the current code’s data flow may miss an important flow that has since been superseded.
- When the code is dynamic (you build your expressions on the fly based on program inputs, data in tables, environmental variables, etc.), you need a way to decode the dynamic code.
- Not all data changes are generated by code. For example, suppose there is an emergency outage on your web site, which your DBA repairs manually by executing a sequence of SQL commands directly on your production database. These changes will never be detected by Decoded Lineage tools, because they were generated by a DBA rather than by code.
- The code may be doing the wrong thing to the data. For example, suppose your code stores personal identification information in violation of GDPR and despite clear requirements to the contrary from the product manager. A decoded lineage tool will faithfully capture what the code does without raising a red flag.
- Suppose two pieces of code in two separate processes are performing the same calculations to create the same duplicate data in the database. Code analysis cannot discover this situation, because each piece of code is behaving properly. Only by examining the database can the duplication be discovered and eliminated.
2. Data similarity lineage
This approach builds lineage information by examining data and schemas without accessing your code. Tools in this category (Tamr, Paxata, Trifacta) profile data in your tables and read database metadata about tables, columns, etc., then use all that information to create lineage based on similarities. On the one hand, this approach will always work regardless of your coding technology, because it analyzes the resulting data, regardless of which technology generated the data. But it has several glaring weaknesses:
- It can take a lot of time and processing power to detect data similarities across a large database.
- The resulting metadata will be missing a lot of details, e.g., transformation logic.
- It cannot detect lineage metadata that has not yet been executed. For example, suppose you have an end-of-year accounting process that adjusts revenues and inventory. Until that process runs on December 31, you will have no lineage metadata available about it.
3. Manual lineage mapping
This approach builds lineage metadata by mapping and documenting the business knowledge in people’s heads (for example, talking to application owners, data stewards and data integration specialists). The advantage of this approach is that it provides prescriptive data lineage (how data should flow as opposed to how it flows after implementation bugs). But, because the metadata is based on human knowledge, it is likely to be contradictory (because two people disagree about the desired data flow) or partial (If you do not know about the existence of a data set, you will not ask anyone about it).
As you can see, there is no magic bullet -- each approach has its strengths and weaknesses. The best solution combines all three approaches.
- Start with decoded lineage, using a tool such as MANTA, Octopai, or Spline.
- Augment with data similarity lineage, using tools such as Tamr, Paxata, or Trifacta, to discover patterns in the database.
- Augment with manual lineage mapping, to capture prescriptive lineage rules (for example, how the data flows were supposed to be implemented).
Once you successfully combine these techniques, you can collect the comprehensive lineage metadata you need to start enjoying the benefits of governed data.
Moshe Kranc is the chief technology officer at Ness Digital Engineering. Kranc has extensive experience in leading adoption of bleeding-edge technologies, having worked for large companies as well as entrepreneurial start-ups. He previously headed the Big Data Center of Excellence at Barclays’ Israel Development Centre (IDEC). He has worked in the high-tech industry for over 30 years in the U.S. and Israel. He was part of the Emmy award-winning team that designed the scrambling system for DIRECTV, and he holds 6 patents in areas related to pay television, computer security and text mining.