Databases Alone Can't Conquer Big Data Problems - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Software // Information Management
Commentary
4/26/2011
09:18 AM
Doug Henschen
Doug Henschen
Commentary
Connect Directly
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Databases Alone Can't Conquer Big Data Problems

Data-integration and data-transformation steps taken before loading the database are sometimes the best antidote to high-volume storage and scaling challenges.

These days, when faced with big-data, companies are used to throwing low-cost storage and MPP horsepower at the problem. But comScore got its start way back in 1999, when disk capacities were measured in the tens of gigabytes (not terabytes) and before the likes of Netezza and Greenplum (founded in 2003 and 2004, and acquired last year by IBM and EMC, respectively) were even around.

ComScore has used DMExpress since 2000, when processing power and storage where still quite costly. The product also supports high-volume extract, transform and load (ETL) work, but these days it's marketed as "data integration acceleration software," designed to be used in conjunction with more popular integration suites such as Informatica PowerCenter and IBM InfoSphere. ComScore only uses DMExpress for sorting, filtering and aggregation, and it uses database-native capabilities, rather than yet another ETL package, for data loading.

In another example of doing the big-data heavy lifting before data enters the database, comScore uses DMExpress to aggregate the thousands of new records collected from each of its two million panelists each week. A first step is to sort the sites visited by URL, so a processing-intensive comScore taxonomy used to categorize Web sites only has to be called when the URL changes.

Instead of classifying the 40 sites a panelist might have visited one-by-one in the order they were visited, they are grouped to, say, the three sites visited overall (with 20 visits to Facebook, 12 to GMail, and eight to The New York Times listed all in a row). "That saves a lot of CPU time and a lot of effort," Brown says.

Put into production in 2009, this sorting step let comScore process daily panel updates in seven hours where it used to take 24. And monthly updates are now delivered on the fifth of the month instead of the 15th. "That's a big win for the business because our customers can get a much quicker understanding of how their campaigns are performing," Brown says.

As I recently reported in "IT And Marketing in the Digital Age", this is exactly the kind of low-latency information marketers now demand from suppliers and the kind of efficiency they're seeking internally.

Not every company operates at comScore's scale, but the lesson is that not every big-data challenge is best left to the high-powered database platform to solve. Sorting, filtering, aggregation, and transformation steps can streamline data before it gets to the data warehouse, saving CPU cycles and storage space before and after the crucial data-loading stage.

Nine times out of 10 when I hear about big data, it's all about the database and the analytics. But I'm increasingly hearing about all the work that takes place even before big data gets moved into warehouses. If you have lessons learned or smart shortcuts you can share, send me an email note. I've placed a big-data-integration feature story in the queue for this fall, and I'm looking for good customer examples.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
2 of 2
Next
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Slideshows
10 Things Your Artificial Intelligence Initiative Needs to Succeed
Lisa Morgan, Freelance Writer,  4/20/2021
News
Tech Spending Climbs as Digital Business Initiatives Grow
Jessica Davis, Senior Editor, Enterprise Apps,  4/22/2021
Commentary
Optimizing the CIO and CFO Relationship
Mary E. Shacklett, Mary E. Shacklett,  4/13/2021
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Successful Strategies for Digital Transformation
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Slideshows
Flash Poll