Software // Information Management
News
10/17/2011
10:51 AM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

3 Big Data Challenges: Expert Advice

The "big" part of big data doesn't tell the whole story. Let's talk volume, variety, and velocity of data--and how you can help your business make sense of all three.

Any time data volumes start getting really big, compression is key because it saves on storage, which remains a significant component of data management expense, despite the continuing decline of hardware costs when measured by the terabyte.

Consider Polk, a household name in the auto industry, which sells online subscriptions to data about vehicle sales and ownership to automakers, part suppliers, dealers, advertising firms, and insurance companies. Polk surpassed 46 TB of storage last year before it launched a long-term upgrade to eventually move from conventional Oracle RAC (clustered) database deployments to an Oracle Exadata appliance. As of this summer, the project was halfway done, and databases that formerly held about 22 TB had been compressed down to about 13 TB, using Oracle's Hybrid Columnar Compression.

Polk's Exadata migration is still in progress, but to date it has consolidated nine databases down to four and eliminated eight of 22 production database servers. The cost of a new Exadata deployment is about $22,000 to $26,000 per terabyte, before discounts, according to independent analyst Curt Monash. If Polk's storage efficiencies hold up through the rest of the project, 46 TB will be trimmed to about 28 TB. Using Monash's estimate, the 18-TB difference could trim the deployment's cost by as much as $400,000.

As companies begin managing big data, crafty IT pros are finding that some old tricks have renewed value when applied to such large volumes. Steps to improve compression and query performance that may not have seemed worth the effort can become more valuable. This is where technical capabilities end, and experience and ingenuity begin.

Wise use of best practices like data sorting lets companies improve the performance, and prolong use, of their current database platforms. And, for those that do move up to one of the latest and greatest big-data platforms, data management discipline instilled from the start will optimize performance and prolong the life of that investment. Sorting, for example, is a relatively easy way to optimize compression. Just as consistency of columnar data aids compression, sorting brings order to data before it's loaded into a database; that makes it easier for compression engines to do their work.

ComScore, the digital-media measurement company, has been using tricks like sorting since its first deployment back in 2000. Sybase IQ has been the company's primary database platform from the beginning, and it makes the most of the product's selective querying and compression capabilities. But with more than 56 TB in its store, ComScore also applies techniques such as sorting to help the database platform do a better job.

ComScore uses Syncsort's DMExpress data-integration software to sort data alphanumerically before loading it into Sybase IQ. While 10 bytes of the raw clickstream data that ComScore typically examines can be compressed to 3 or 4 bytes by Sybase IQ, 10 bytes of sorted clickstream data can often be crunched down to 1 byte, according to ComScore CTO Michael Brown.

Sorting also can streamline processing, and that improves speed as well as lowering storage costs. For example, ComScore sorts URL data to minimize how often the system has to look up the taxonomy that describes, say, ESPN.com as a sports site, Ford.com as an auto site, Google News as a news site, and Facebook as a social network. Think of someone who spends a Sunday afternoon bouncing across those sites, checking scores, reading news, browsing for a car, and posting on Facebook.

Instead of loading the URLs visited during that Web session in the order they were visited, possibly triggering a dozen or more site lookups, sorted data would lump all visits to the same sites together, triggering just four lookups. "That saves a lot of CPU time and a lot of effort," Brown says.

Polk also relies on sorting to cut processing time in a slightly different way. The Oracle database has built-in indexing capabilities that can help improve query performance, but the feature may not work if it can't spot obvious groupings of data. Sorting helps Polk force indexing to happen the way it's most useful. "If you can lump the data that goes together, the index knows exactly where to find the data you're after," says Doug Miller, Polk's director of database development and operations.

Polk subscribers often do queries by region, so the company sorts auto sales data by ZIP code. If a car manufacturer wants to know which models were the best sellers in Seattle last month, the database knows just where to find that data and won't waste time querying data tied to nonrelevant ZIP codes.

Polk is also making extensive use of "materialized views," which effectively store often-requested query results for rapid recall. Exadata's compression has helped reduce the size of materialized views, which lets Polk do more sophisticated analyses, since it can hold more views in cache and thus speed up performance when exploring multiple dimensions of data.

"If a customer wanted to look across multiple dealer zones and then start bringing in customer demographics and comparing lease transactions versus purchases, that would have taken as long as two to five minutes in the old environment," Miller says. "In Exadata, these sorts of queries are running in 10 seconds."

The critical point is that these tricks for managing data volume are about more than cutting storage costs. Getting faster, more relevant insight is really the name of the game with big data.

Previous
2 of 5
Next
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Doug Laney
50%
50%
Doug Laney,
User Rank: Apprentice
11/29/2011 | 3:56:41 AM
re: 3 Big Data Challenges: Expert Advice
Great to see the 3Vs framework for big data catching on. Anyone interested in the original 2001 Gartner (then Meta Group) research paper I published positing the 3Vs, entitled "Three Dimensional Data Challenges," feel free to reach me. -Doug Laney, VP Research, Gartner
HM
50%
50%
HM,
User Rank: Strategist
10/19/2011 | 12:38:57 PM
re: 3 Big Data Challenges: Expert Advice
I noticed that you havenG«÷t mentioned the HPCC offering from LexisNexis Risk Solutions. Unlike Hadoop distributions which have only been available since 2009, HPCC is a mature platform, and provides for a data delivery engine together with a data transformation and linking system equivalent to Hadoop. The main advantages over other alternatives are the real-time delivery of data queries and the extremely powerful ECL language programming model.
Check them out at: www.hpccsystems.com
PulpTechie
50%
50%
PulpTechie,
User Rank: Apprentice
10/18/2011 | 3:13:05 PM
re: 3 Big Data Challenges: Expert Advice
Interesting read. Thanks for sharing.
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A UBM Tech Radio episode on the changing economics of Flash storage used in data tiering -- sponsored by Dell.
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.