To hoard or not to hoard? EMC's Bill Schmarzo defends the practice of saving all the data you can, even when its value is uncertain.
10 More Big Data Pros To Follow On Twitter
(Click image for larger view and slideshow.)
Which data should you keep? Which should you toss? And how can you determine which data will deliver actionable insights at some point down the road, and which is merely taking up valuable storage space?
There's no easy answer, of course, and the solution will almost certainly vary by organization or industry. But two camps are emerging in the what-to-keep debate. One professes that big data's real value comes from near-real-time analysis of information and that archived data doesn't deliver a lot of bang for the buck. Another, however, argues that it's good business sense to store all the data you can.
Bill Schmarzo, chief technical officer of EMC Global Services, is a proud member of the data hoarder camp. "I'm a hoarder; I want it all," Schmarzo told InformationWeek in a phone interview. "And even if I don't know yet how I'll use that data, I want it because I can store it so cheaply. My data science team might find a use for it."
Perhaps it's not surprising that an executive of EMC, a major player in the data storage industry, would be a strong proponent of the save-it-all philosophy. But Schmarzo does make a compelling argument backed by a key technological trend: The cost of data storage continues to plummet dramatically.
One reason is that the Hadoop Distributed File System (HDFS) stores data at a much lower cost than traditional RDBMS systems. Schmarzo passed along an anecdote about a friend of his, an executive in charge of analytics for a national insurance company: "He found that it cost the same to store four terabytes on his enterprise data warehouse as it did to put 200 terabytes on Hadoop -- that's a 50x improvement."
Greatly reduced storage costs allow you to think differently about how you approach and monetize data, Schmarzo added. "We need to have a data-abundance mentality. We want it all. We want to share [data], grab it, play with it, and figure out what's there. And if it's not useful, shove it back into its bin and go onto the next data source."
Schmarzo provided an example of how a large grocery chain mined 15 years' worth of data on its customers' buying habits. Thirteen months of this "loyalty card" data was stored in the company's data warehouse; the rest was archived on tape drives. "Their key business initiative is around personalized marketing offers," said Schmarzo. "They wanted to leverage their [mobile] app to deliver personalized based on all this customer data they have."
An analysis of the data revealed an interesting, and actionable, insight: The 15-year time period included two recessions. This allowed the grocery chain to determine when shoppers were first impacted by the economic slowdowns, and when they started to recover from them.
"We identified three things that people do when they start struggling," Schmarzo said. "First off, they stop buying higher-quality products and start buying lower-quality products. Two, they start buying private-label products. Tissue paper is the first one to go, for some odd reason. Three, they start using coupons more often."
In retrospect, this behavior makes sense. "That's exactly what the data told us," said Schmarzo. "You can extrapolate based on demographics, geography, and behavioral groups -- all different ways to slice and dice the data, once you have it at that very low level of granularity."
He added, "We would not have had [those insights] if we had not had access to 15 years of data. So I'm a big believer in 'I want it all.' "
InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.
Jeff Bertolucci is a technology journalist in Los Angeles who writes mostly for Kiplinger's Personal Finance, The Saturday Evening Post, and InformationWeek. View Full Bio
6 Tools to Protect Big DataMost IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Big Data Brings Big Security ProblemsWhy should big data be more difficult to secure? In a word, variety. But the business won’t wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.