Software // Information Management
Commentary
12/14/2010
09:22 AM
Curt Monash
Curt Monash
Commentary
50%
50%

Add Derived Data To Your DBMS Strategy

Do you have a plan for managing more than just raw data? These five kinds of data can change the demands on your database management system.

When we think of data warehouses, we usually think of them as storing raw data, and lots of it. But there's no escaping the importance of derived, augmented, enhanced, adjusted or cooked data in analytic data processing.

Five areas I have in mind are, loosely:

Aggregates, which are commonly maintained in precalculated form for reasons of performance or response time, especially by users of older analytic database management technologies.

Calculated scores, commonly based on data mining/predictive analytics. These are hugely important in marketing applications and in their anti-fraud cousins.

Text analytics, most notably when trying to extract market insight from various kinds of internet and social media pages, posts, and communications, but also in problem-detection applications ranging from anti-terrorism to warranty analysis.

Various kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for, such as log-file processing and, again, text analytics.

Adjusted data, such as information adjusted for weather or seasonality. This comes up especially, but not only, in scientific contexts.

Probably there are yet more examples that I am overlooking. But even these should suffice to establish my point, as might even just the broad list of synonyms for the concept of "derived data" I've used above. Namely, one of the first questions one should ask in considering an analytic data management strategy is:

Do we have to plan for data other than what we will be storing in raw form?

Any derived data could, in principle, be re-derived each time it is needed, except in those cases where issues of security, data ownership, or whatever prevent access to the underlying raw data entirely. Thus, the reason to store derived data is usually just a matter of physical processing, as reflected in performance, price/performance, response time and the like.

This might suggest that the decision whether or not to explicitly store derived data depends on the performance characteristics of your analytic database management system (DBMS) and the related technology stack. In practice, however, that often turns out not to be the case.

Choice of technology stack does indeed have a major effect on the first category I mentioned: Aggregates. Whether or not you want to maintain a physical representation of a sum, average, roll-up or whatever has a lot to do with which particular DBMS or in-memory analytic tool you are using.

In Oracle, especially pre-Exadata, you're apt to have a lot of materialized views. In Netezza, not so much. If you're using a MOLAP (Multidimensional OnLine Analytic Processing), tool such as Essbase, you're probably going crazy with pre-calculated roll-ups. And if you're using Skytide, you may not be keeping the unaggregated raw data at all.

Something similar could be said about the simpler forms of data mining scoring; if you're just doing a weighted sum, precalculation is a nice-to-have, not a must-have, depending on the speed and power of your DBMS. But that's about as far as it goes.

For more complex kinds of predictive analytic models, real-time scoring could be prohibitively slow. Ditto for social graph analysis, and the same goes for the other examples as well.

Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Dec. 9, 2014
Apps will make or break the tablet as a work device, but don't shortchange critical factors related to hardware, security, peripherals, and integration.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of December 14, 2014. Be here for the show and for the incredible Friday Afternoon Conversation that runs beside the program.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.