Add Derived Data To Your DBMS Strategy - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Software // Information Management
Commentary
12/14/2010
09:22 AM
Curt Monash
Curt Monash
Commentary
50%
50%

Add Derived Data To Your DBMS Strategy

Do you have a plan for managing more than just raw data? These five kinds of data can change the demands on your database management system.

Text analytics requires a lot of processing per document. You need to tokenize (among other things, identify the boundaries of) the words, sentences and paragraphs; identify the words' meaning; map out the grammar; resolve references such as pronouns; and often do more besides (e.g. sentiment analysis).

There are a double-digit number of steps to all that, many of them expensive. No way are you going to redo the whole process each time you do a query. (Not coincidentally, MarkLogic -- which does a huge fraction of its business in text-oriented uses -- thinks heavily in terms of the enhancement and augmentation of data.)

If you look through a list of actual Hadoop or other MapReduce use cases, you'll see that a lot of them boil down to "crunch data in a big batch job to get it ready for further processing." Most famously this gets done to weblogs, documents, images, or other nontabular data, but it can also happen to time series or traditional relational tables as well. (See, for example, the use cases in two recent Aster Data slide decks.) Generally, those are not processes that you want to try to run real time.

Scientists have a massive need to adjust or "cook" data, a point that emerged into the public consciousness in connection with Climategate. The LSST project expects to store 4.5 petabytes of derived data per year, for a decade. Types of scientific data cooking include:

Log processing, not unlike that done in various commercial sectors.

Assigning data to different kinds or densities of coordinate grids -- "regridding" -- often through a process of interpolation/approximation/estimation.

Adjusting/normalizing data for all kinds of effects (such as weather cycles).

Examples where data adjustment is needed can be found all over physical and social science and engineering. In some cases you might be able to get by with recalculating all that on the fly, but in many instances storing derived data is the only realistic option.

Similar issues arise in marketing applications, even beyond the kind of straightforward, predictive-analytics-based scoring and psychographic/clustering results one might expect.

For example, suppose you enter bogus information into some kind of online registration form, claiming to be a 90-year-old woman when, in fact, you're a 32-year-old male with 400 Facebook friends who are mostly in your age range. Let's say you tend to look at Web sites about cars, poker, and video games and have a propensity to click on ads featuring scantily-clad females.

Increasingly, analytic systems presented with this scenario would be smart enough to treat you as somebody other than your grandmother. But those too are complex analyses, run in advance, with the results stored in the database to fuel sub-second ad serving response times.

Curt Monash runs Monash Research, which provides strategic advice to users and vendors of advanced information technology. He also writes the blogs DBMS 2, Text Technologies, and Strategic Messaging. Write him at [email protected]

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
2 of 2
Next
Comment  | 
Print  | 
More Insights
Commentary
Get Your Enterprise Ready for 5G
Mary E. Shacklett, Mary E. Shacklett,  1/14/2020
Commentary
Modern App Dev: An Enterprise Guide
Cathleen Gagne, Managing Editor, InformationWeek,  1/5/2020
Slideshows
9 Ways to Improve IT and Operational Efficiencies in 2020
Cynthia Harvey, Freelance Journalist, InformationWeek,  1/2/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
Slideshows
Flash Poll