Add Derived Data To Your DBMS Strategy
Do you have a plan for managing more than just raw data? These five kinds of data can change the demands on your database management system.
When we think of data warehouses, we usually think of them as storing raw data, and lots of it. But there's no escaping the importance of derived, augmented, enhanced, adjusted or cooked data in analytic data processing.
Five areas I have in mind are, loosely:
More Software Insights
- Why is Information Governance So Important for Modern Analytics?
- The Critical Importance of High Performance Data Integration for Big Data Analytics
White PapersMore >>
Aggregates, which are commonly maintained in precalculated form for reasons of performance or response time, especially by users of older analytic database management technologies.
Calculated scores, commonly based on data mining/predictive analytics. These are hugely important in marketing applications and in their anti-fraud cousins.
Text analytics, most notably when trying to extract market insight from various kinds of internet and social media pages, posts, and communications, but also in problem-detection applications ranging from anti-terrorism to warranty analysis.
Various kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for, such as log-file processing and, again, text analytics.
Adjusted data, such as information adjusted for weather or seasonality. This comes up especially, but not only, in scientific contexts.
Probably there are yet more examples that I am overlooking. But even these should suffice to establish my point, as might even just the broad list of synonyms for the concept of "derived data" I've used above. Namely, one of the first questions one should ask in considering an analytic data management strategy is:
Do we have to plan for data other than what we will be storing in raw form?
Any derived data could, in principle, be re-derived each time it is needed, except in those cases where issues of security, data ownership, or whatever prevent access to the underlying raw data entirely. Thus, the reason to store derived data is usually just a matter of physical processing, as reflected in performance, price/performance, response time and the like.
This might suggest that the decision whether or not to explicitly store derived data depends on the performance characteristics of your analytic database management system (DBMS) and the related technology stack. In practice, however, that often turns out not to be the case.
Choice of technology stack does indeed have a major effect on the first category I mentioned: Aggregates. Whether or not you want to maintain a physical representation of a sum, average, roll-up or whatever has a lot to do with which particular DBMS or in-memory analytic tool you are using.
In Oracle, especially pre-Exadata, you're apt to have a lot of materialized views. In Netezza, not so much. If you're using a MOLAP (Multidimensional OnLine Analytic Processing), tool such as Essbase, you're probably going crazy with pre-calculated roll-ups. And if you're using Skytide, you may not be keeping the unaggregated raw data at all.
Something similar could be said about the simpler forms of data mining scoring; if you're just doing a weighted sum, precalculation is a nice-to-have, not a must-have, depending on the speed and power of your DBMS. But that's about as far as it goes.