Databases Alone Can't Conquer Big Data Problems - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Software // Information Management
09:18 AM
Doug Henschen
Doug Henschen
Connect Directly

Databases Alone Can't Conquer Big Data Problems

Data-integration and data-transformation steps taken before loading the database are sometimes the best antidote to high-volume storage and scaling challenges.

When it comes to making sense of big data, the glory is hogged by database platforms such as EMC Greenplum, IBM Netezza, Oracle Exadata, and Teradata.

But sometimes simple data processing work done outside of the database can help you scale and eliminate hours or even days of processing on expensive database platforms.

Marketing data provider comScore, for example, uses data-sorting techniques to improve compression and aggregate data before it even gets to the warehouse. As a result it's saving on storage, reducing processing times, and, most importantly, speeding information to its data-hungry customers.

As I detailed late last year, comScore has been a leading source of online marketing data for more than a decade. As such it was a pioneer of big-data computing. At 150 terabytes, the company's latest Sybase IQ warehousing platform sounds big, but it would have to be many times larger if not for the company's skill at compressing and aggregating data.

The big-data leagues span from the tens of terabytes into the petabytes. That's when it becomes essential to add the power of massively parallel processing (MPP), used by most of the leading platforms, or the compression advantages of column-oriented databases (such as Sybase IQ and HP Vertica). But organizations playing at this scale also have to manage big data before it gets into the database.

To give you some idea, comScore tracks the daily Internet surfing (and mobile-access) habits of about 2 million consumer panelists who have registered and supplied their demographic and psychographic profiles. The company also takes a daily census of activity across the Internet so it can report on and compare Internet-wide behavior to that of targeted segments tracked through the panel data. As a result, comScore collects about 2 billion new rows of panel data and more than 18 billion new rows of census data each day.

That means more than 20 million rows of new data is loaded into the data warehouse each day. Of course, most every organization will apply compression to reduce storage demands. But comScore also uses Syncsort DMExpress data integration software to sort and bring alphanumeric order to the data before it's loaded into the warehouse. This improves compression ratios.

Where 10 bytes of unsorted data can be compressed to three or four bytes, says Michael Brown, comScore's chief technology officer, 10 bytes of sorted data can typically be crunched down to one byte. "That makes a huge different in the volume of data we have to store, and it streamlines our processes and reduces our capital costs," Brown says.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Pandemic Responses Make Room for More Data Opportunities
Jessica Davis, Senior Editor, Enterprise Apps,  5/4/2021
10 Things Your Artificial Intelligence Initiative Needs to Succeed
Lisa Morgan, Freelance Writer,  4/20/2021
Transformation, Disruption, and Gender Diversity in Tech
Joao-Pierre S. Ruth, Senior Writer,  5/6/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll