When we think of data warehouses, we usually think of them as storing raw data, and lots of it. But there's no escaping the importance of derived, augmented, enhanced, adjusted or cooked data in analytic data processing.
Five areas I have in mind are, loosely:
Aggregates, which are commonly maintained in precalculated form for reasons of performance or response time, especially by users of older analytic database management technologies.
Calculated scores, commonly based on data mining/predictive analytics. These are hugely important in marketing applications and in their anti-fraud cousins.
Text analytics, most notably when trying to extract market insight from various kinds of internet and social media pages, posts, and communications, but also in problem-detection applications ranging from anti-terrorism to warranty analysis.
Various kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for, such as log-file processing and, again, text analytics.
Adjusted data, such as information adjusted for weather or seasonality. This comes up especially, but not only, in scientific contexts.
Probably there are yet more examples that I am overlooking. But even these should suffice to establish my point, as might even just the broad list of synonyms for the concept of "derived data" I've used above. Namely, one of the first questions one should ask in considering an analytic data management strategy is:
Do we have to plan for data other than what we will be storing in raw form?
Any derived data could, in principle, be re-derived each time it is needed, except in those cases where issues of security, data ownership, or whatever prevent access to the underlying raw data entirely. Thus, the reason to store derived data is usually just a matter of physical processing, as reflected in performance, price/performance, response time and the like.
This might suggest that the decision whether or not to explicitly store derived data depends on the performance characteristics of your analytic database management system (DBMS) and the related technology stack. In practice, however, that often turns out not to be the case.
Choice of technology stack does indeed have a major effect on the first category I mentioned: Aggregates. Whether or not you want to maintain a physical representation of a sum, average, roll-up or whatever has a lot to do with which particular DBMS or in-memory analytic tool you are using.
In Oracle, especially pre-Exadata, you're apt to have a lot of materialized views. In Netezza, not so much. If you're using a MOLAP (Multidimensional OnLine Analytic Processing), tool such as Essbase, you're probably going crazy with pre-calculated roll-ups. And if you're using Skytide, you may not be keeping the unaggregated raw data at all.
Something similar could be said about the simpler forms of data mining scoring; if you're just doing a weighted sum, precalculation is a nice-to-have, not a must-have, depending on the speed and power of your DBMS. But that's about as far as it goes.
For more complex kinds of predictive analytic models, real-time scoring could be prohibitively slow. Ditto for social graph analysis, and the same goes for the other examples as well.
Text analytics requires a lot of processing per document. You need to tokenize (among other things, identify the boundaries of) the words, sentences and paragraphs; identify the words' meaning; map out the grammar; resolve references such as pronouns; and often do more besides (e.g. sentiment analysis).
There are a double-digit number of steps to all that, many of them expensive. No way are you going to redo the whole process each time you do a query. (Not coincidentally, MarkLogic -- which does a huge fraction of its business in text-oriented uses -- thinks heavily in terms of the enhancement and augmentation of data.)
If you look through a list of actual Hadoop or other MapReduce use cases, you'll see that a lot of them boil down to "crunch data in a big batch job to get it ready for further processing." Most famously this gets done to weblogs, documents, images, or other nontabular data, but it can also happen to time series or traditional relational tables as well. (See, for example, the use cases in two recent Aster Data slide decks.) Generally, those are not processes that you want to try to run real time.
Scientists have a massive need to adjust or "cook" data, a point that emerged into the public consciousness in connection with Climategate. The LSST project expects to store 4.5 petabytes of derived data per year, for a decade. Types of scientific data cooking include:
Log processing, not unlike that done in various commercial sectors.
Assigning data to different kinds or densities of coordinate grids -- "regridding" -- often through a process of interpolation/approximation/estimation.
Adjusting/normalizing data for all kinds of effects (such as weather cycles).
Examples where data adjustment is needed can be found all over physical and social science and engineering. In some cases you might be able to get by with recalculating all that on the fly, but in many instances storing derived data is the only realistic option.
Similar issues arise in marketing applications, even beyond the kind of straightforward, predictive-analytics-based scoring and psychographic/clustering results one might expect.
For example, suppose you enter bogus information into some kind of online registration form, claiming to be a 90-year-old woman when, in fact, you're a 32-year-old male with 400 Facebook friends who are mostly in your age range. Let's say you tend to look at Web sites about cars, poker, and video games and have a propensity to click on ads featuring scantily-clad females.
Increasingly, analytic systems presented with this scenario would be smart enough to treat you as somebody other than your grandmother. But those too are complex analyses, run in advance, with the results stored in the database to fuel sub-second ad serving response times.
Curt Monash runs Monash Research, which provides strategic advice to users and vendors of advanced information technology. He also writes the blogs DBMS 2, Text Technologies, and Strategic Messaging. Write him at [email protected]