When Data Hoarding Makes Sense

To hoard or not to hoard? EMC's Bill Schmarzo defends the practice of saving all the data you can, even when its value is uncertain.

Jeff Bertolucci, Contributor

July 22, 2014

4 Min Read

10 More Big Data Pros To Follow On Twitter

10 More Big Data Pros To Follow On Twitter

10 More Big Data Pros To Follow On Twitter (Click image for larger view and slideshow.)

Which data should you keep? Which should you toss? And how can you determine which data will deliver actionable insights at some point down the road, and which is merely taking up valuable storage space?

There's no easy answer, of course, and the solution will almost certainly vary by organization or industry. But two camps are emerging in the what-to-keep debate. One professes that big data's real value comes from near-real-time analysis of information and that archived data doesn't deliver a lot of bang for the buck. Another, however, argues that it's good business sense to store all the data you can.

Bill Schmarzo, chief technical officer of EMC Global Services, is a proud member of the data hoarder camp. "I'm a hoarder; I want it all," Schmarzo told InformationWeek in a phone interview. "And even if I don't know yet how I'll use that data, I want it because I can store it so cheaply. My data science team might find a use for it."

[Should research data be publicly accessible? Read Big Data Needs Democratization, CloudSigma Says.]

Perhaps it's not surprising that an executive of EMC, a major player in the data storage industry, would be a strong proponent of the save-it-all philosophy. But Schmarzo does make a compelling argument backed by a key technological trend: The cost of data storage continues to plummet dramatically.

One reason is that the Hadoop Distributed File System (HDFS) stores data at a much lower cost than traditional RDBMS systems. Schmarzo passed along an anecdote about a friend of his, an executive in charge of analytics for a national insurance company: "He found that it cost the same to store four terabytes on his enterprise data warehouse as it did to put 200 terabytes on Hadoop -- that's a 50x improvement."

Greatly reduced storage costs allow you to think differently about how you approach and monetize data, Schmarzo added. "We need to have a data-abundance mentality. We want it all. We want to share [data], grab it, play with it, and figure out what's there. And if it's not useful, shove it back into its bin and go onto the next data source."

Schmarzo provided an example of how a large grocery chain mined 15 years' worth of data on its customers' buying habits. Thirteen months of this "loyalty card" data was stored in the company's data warehouse; the rest was archived on tape drives. "Their key business initiative is around personalized marketing offers," said Schmarzo. "They wanted to leverage their [mobile] app to deliver personalized based on all this customer data they have."

An analysis of the data revealed an interesting, and actionable, insight: The 15-year time period included two recessions. This allowed the grocery chain to determine when shoppers were first impacted by the economic slowdowns, and when they started to recover from them.

"We identified three things that people do when they start struggling," Schmarzo said. "First off, they stop buying higher-quality products and start buying lower-quality products. Two, they start buying private-label products. Tissue paper is the first one to go, for some odd reason. Three, they start using coupons more often."

In retrospect, this behavior makes sense. "That's exactly what the data told us," said Schmarzo. "You can extrapolate based on demographics, geography, and behavioral groups -- all different ways to slice and dice the data, once you have it at that very low level of granularity."

He added, "We would not have had [those insights] if we had not had access to 15 years of data. So I'm a big believer in 'I want it all.' "

InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.

About the Author(s)

Jeff Bertolucci


Jeff Bertolucci is a technology journalist in Los Angeles who writes mostly for Kiplinger's Personal Finance, The Saturday Evening Post, and InformationWeek.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights