Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.
Add Derived Data To Your DBMS Strategy
Do you have a plan for managing more than just raw data? These five kinds of data can change the demands on your database management system.
December 14, 2010
3 Min Read
Text analytics requires a lot of processing per document. You need to tokenize (among other things, identify the boundaries of) the words, sentences and paragraphs; identify the words' meaning; map out the grammar; resolve references such as pronouns; and often do more besides (e.g. sentiment analysis).
There are a double-digit number of steps to all that, many of them expensive. No way are you going to redo the whole process each time you do a query. (Not coincidentally, MarkLogic -- which does a huge fraction of its business in text-oriented uses -- thinks heavily in terms of the enhancement and augmentation of data.)
If you look through a list of actual Hadoop or other MapReduce use cases, you'll see that a lot of them boil down to "crunch data in a big batch job to get it ready for further processing." Most famously this gets done to weblogs, documents, images, or other nontabular data, but it can also happen to time series or traditional relational tables as well. (See, for example, the use cases in two recent Aster Data slide decks.) Generally, those are not processes that you want to try to run real time.
Scientists have a massive need to adjust or "cook" data, a point that emerged into the public consciousness in connection with Climategate. The LSST project expects to store 4.5 petabytes of derived data per year, for a decade. Types of scientific data cooking include:
Log processing, not unlike that done in various commercial sectors.
Assigning data to different kinds or densities of coordinate grids -- "regridding" -- often through a process of interpolation/approximation/estimation.
Adjusting/normalizing data for all kinds of effects (such as weather cycles).
Examples where data adjustment is needed can be found all over physical and social science and engineering. In some cases you might be able to get by with recalculating all that on the fly, but in many instances storing derived data is the only realistic option.
Similar issues arise in marketing applications, even beyond the kind of straightforward, predictive-analytics-based scoring and psychographic/clustering results one might expect.
For example, suppose you enter bogus information into some kind of online registration form, claiming to be a 90-year-old woman when, in fact, you're a 32-year-old male with 400 Facebook friends who are mostly in your age range. Let's say you tend to look at Web sites about cars, poker, and video games and have a propensity to click on ads featuring scantily-clad females.
Increasingly, analytic systems presented with this scenario would be smart enough to treat you as somebody other than your grandmother. But those too are complex analyses, run in advance, with the results stored in the database to fuel sub-second ad serving response times.
Curt Monash runs Monash Research, which provides strategic advice to users and vendors of advanced information technology. He also writes the blogs DBMS 2, Text Technologies, and Strategic Messaging. Write him at [email protected]
About the Author(s)
You May Also Like
Data Center Firewall Toolkit
*Why DDI? Why it is Important to Integrate DNS, DHCP, and IP Address Management in Your Network
Hybrid Mesh Firewall: An Essential Solution for Today's Distributed Enterprise
*State of ITSM in Retail
A revolution in healthcare IT service management: How automation is driving improvements in a complex environment