Do you have a plan for managing more than just raw data? These five kinds of data can change the demands on your database management system.
When we think of data warehouses, we usually think of them as storing raw data, and lots of it. But there's no escaping the importance of derived, augmented, enhanced, adjusted or cooked data in analytic data processing.
Five areas I have in mind are, loosely:
Aggregates, which are commonly maintained in precalculated form for reasons of performance or response time, especially by users of older analytic database management technologies.
Calculated scores, commonly based on data mining/predictive analytics. These are hugely important in marketing applications and in their anti-fraud cousins.
Text analytics, most notably when trying to extract market insight from various kinds of internet and social media pages, posts, and communications, but also in problem-detection applications ranging from anti-terrorism to warranty analysis.
Various kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for, such as log-file processing and, again, text analytics.
Adjusted data, such as information adjusted for weather or seasonality. This comes up especially, but not only, in scientific contexts.
Probably there are yet more examples that I am overlooking. But even these should suffice to establish my point, as might even just the broad list of synonyms for the concept of "derived data" I've used above. Namely, one of the first questions one should ask in considering an analytic data management strategy is:
Do we have to plan for data other than what we will be storing in raw form?
Any derived data could, in principle, be re-derived each time it is needed, except in those cases where issues of security, data ownership, or whatever prevent access to the underlying raw data entirely. Thus, the reason to store derived data is usually just a matter of physical processing, as reflected in performance, price/performance, response time and the like.
This might suggest that the decision whether or not to explicitly store derived data depends on the performance characteristics of your analytic database management system (DBMS) and the related technology stack. In practice, however, that often turns out not to be the case.
Choice of technology stack does indeed have a major effect on the first category I mentioned: Aggregates. Whether or not you want to maintain a physical representation of a sum, average, roll-up or whatever has a lot to do with which particular DBMS or in-memory analytic tool you are using.
In Oracle, especially pre-Exadata, you're apt to have a lot of materialized views. In Netezza, not so much. If you're using a MOLAP (Multidimensional OnLine Analytic Processing), tool such as Essbase, you're probably going crazy with pre-calculated roll-ups. And if you're using Skytide, you may not be keeping the unaggregated raw data at all.
Something similar could be said about the simpler forms of data mining scoring; if you're just doing a weighted sum, precalculation is a nice-to-have, not a must-have, depending on the speed and power of your DBMS. But that's about as far as it goes.
How Enterprises Are Attacking the IT Security EnterpriseTo learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
2017 State of IT ReportIn today's technology-driven world, "innovation" has become a basic expectation. IT leaders are tasked with making technical magic, improving customer experience, and boosting the bottom line -- yet often without any increase to the IT budget. How are organizations striking the balance between new initiatives and cost control? Download our report to learn about the biggest challenges and how savvy IT executives are overcoming them.