Do you have a plan for managing more than just raw data? These five kinds of data can change the demands on your database management system.
When we think of data warehouses, we usually think of them as storing raw data, and lots of it. But there's no escaping the importance of derived, augmented, enhanced, adjusted or cooked data in analytic data processing.
Five areas I have in mind are, loosely:
Aggregates, which are commonly maintained in precalculated form for reasons of performance or response time, especially by users of older analytic database management technologies.
Calculated scores, commonly based on data mining/predictive analytics. These are hugely important in marketing applications and in their anti-fraud cousins.
Text analytics, most notably when trying to extract market insight from various kinds of internet and social media pages, posts, and communications, but also in problem-detection applications ranging from anti-terrorism to warranty analysis.
Various kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for, such as log-file processing and, again, text analytics.
Adjusted data, such as information adjusted for weather or seasonality. This comes up especially, but not only, in scientific contexts.
Probably there are yet more examples that I am overlooking. But even these should suffice to establish my point, as might even just the broad list of synonyms for the concept of "derived data" I've used above. Namely, one of the first questions one should ask in considering an analytic data management strategy is:
Do we have to plan for data other than what we will be storing in raw form?
Any derived data could, in principle, be re-derived each time it is needed, except in those cases where issues of security, data ownership, or whatever prevent access to the underlying raw data entirely. Thus, the reason to store derived data is usually just a matter of physical processing, as reflected in performance, price/performance, response time and the like.
This might suggest that the decision whether or not to explicitly store derived data depends on the performance characteristics of your analytic database management system (DBMS) and the related technology stack. In practice, however, that often turns out not to be the case.
Choice of technology stack does indeed have a major effect on the first category I mentioned: Aggregates. Whether or not you want to maintain a physical representation of a sum, average, roll-up or whatever has a lot to do with which particular DBMS or in-memory analytic tool you are using.
In Oracle, especially pre-Exadata, you're apt to have a lot of materialized views. In Netezza, not so much. If you're using a MOLAP (Multidimensional OnLine Analytic Processing), tool such as Essbase, you're probably going crazy with pre-calculated roll-ups. And if you're using Skytide, you may not be keeping the unaggregated raw data at all.
Something similar could be said about the simpler forms of data mining scoring; if you're just doing a weighted sum, precalculation is a nice-to-have, not a must-have, depending on the speed and power of your DBMS. But that's about as far as it goes.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.