Big Data // Big Data Analytics
News
7/7/2014
09:35 AM
Connect Directly
Google+
RSS
E-Mail
50%
50%

Big Data: No Hoarding Allowed

The best insights come from data you've just collected, not the musty bits you've saved for years, argues SumAll's CEO.

Hadoop Jobs: 9 Ways To Get Hired
Hadoop Jobs: 9 Ways To Get Hired
(Click image for larger view and slideshow.)

The save-everything mantra chanted by many big data proponents is a waste of money and resources, as organizations will gain little, if any, actionable insights from massive stockpiles of archived data. Rather, the real big data payback comes from near-real-time analysis of information as it's collected.

So says Dane Atkinson, CEO of SumAll, a three-year-old data analytics startup based in New York City. SumAll's platform takes in data from a variety of sources, including social media, email, and e-commerce, and allows companies to analyze the information right away.

Given the real-time nature of SumAll's business, perhaps it's no surprise that its CEO would preach the benefits of fast-acting data analysis. Then again, Atkinson isn't the only big data player to point out the shortcomings of information hoarding.

In a phone interview with InformationWeek, Atkinson noted that companies often warehouse big data at great expense, even when they're not sure what insights they'll gain from it. And if they don't know which questions to ask of it today, they're hopeful the astute queries will come months, or even years, down the road.

[Leave the geek-speak at the office. Learn How To Explain Big Data To A 5th Grader.]

"That's the theory. That's exactly it: 'We don't know smart questions to ask now, so we're going to keep it all so that we can ask them later,'" said Atkinson, distilling the common rationale behind data hoarding, which he considers an expensive process with a dubious ROI.

"It costs a lot of money," he said. "It costs us millions of dollars a year to store our customers' data."

But despite the expense, the popular trend is to save it all.

"It's not even a question. Every company, every Internet company, tries to store all the data they possibly can," he claimed. "They believe in this theory of big data, that it'll someday be valuable."

(Source: W.Rebel)
(Source: W.Rebel)

Atkinson wasn't suggesting that companies stop storing data altogether, but rather that they do so more efficiently and with a clearly defined strategy.

"We would highly discourage storing it in a fashion that's sort of the definition of big data -- where you have it in some SSD environment on Amazon, or on a rack of servers that are costing you a fortune -- because you're not getting value out of it," he said. "You're not asking questions because it's just too big."

Still, companies often become data hoarders.

"They're living in the hoarder's environment," said Atkinson. "They're taking in all the data and putting it into a repository."

One alternative: Rather than saving every bit, companies should determine the questions they want to ask of their data, and then store the indexes they really need, a move that "will take your data down by many factors," he claimed.

Take a retail business, for instance.

"You may not need to have every second's worth of transactional history over the last four years, but it's probably pretty handy to know how [each] day went," said Atkinson. "So rolling up those 60 minutes into an hour metric [will] give your team really good guidance on the trends and patterns they want to see."

Rather than storing, say, the 2 billion transactions your business did in the past two years, save an index that tallies the hourly transaction totals during that period, he added.

This approach can greatly reduce the size of your data hoard -- "gigabytes versus terabytes," claimed Atkinson.

Again, however, he finds few businesses are slimming their data stockpiles.

"It's only the really smart companies that have started to pare that down," said Atkinson. "They may have the hoarder's closet somewhere, but they've also made a new [data] store that's much more efficient, that tries to answer smart questions and not just grab hold of everything."

InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.

Jeff Bertolucci is a technology journalist in Los Angeles who writes mostly for Kiplinger's Personal Finance, The Saturday Evening Post, and InformationWeek. View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Oldest First  |  Newest First  |  Threaded View
Lorna Garey
50%
50%
Lorna Garey,
User Rank: Author
7/7/2014 | 1:47:24 PM
Theory vs. reality
It's all great in theory. However, to save selectively requires effort and will -- data classification programs, someone to decide to delete X set and take the fall if it's needed someday, etc. Meanwhile, storage is cheap and getting cheaper.
Thomas Claburn
50%
50%
Thomas Claburn,
User Rank: Author
7/7/2014 | 1:55:27 PM
Re: Theory vs. reality
If only someone could convince the NSA of the merits of not hoarding data.
Doug Henschen
50%
50%
Doug Henschen,
User Rank: Apprentice
7/7/2014 | 3:35:23 PM
Re: Theory vs. reality
Another opinion on big data from a self-interested vendor. Atkinson's "cost millions to data warehouse" perspective is a little dated. And the example he offers, tied to structured transactional data, is also not a very "big data" frame of reference.

The point of aggregating to the hour instead of the second is simple enough -- conventional wisdom, really. But this seems like a very conventional frame of reference focused on developing analytics based on recency, frequency, and monetary value. What about variable data types like clickstreams, log files, or social data? That's when data gets really big. It's not just a matter of collecting more of the same old data. 
zaious
50%
50%
zaious,
User Rank: Strategist
7/7/2014 | 11:19:21 PM
Re: Theory vs. reality
Playing it safe (storing all) is okay to a certain point -this keeps one risk free. However, it will be gone since the amount of data generation is increasing and it will require more to store. Humans are ,by nature, hoarders. At the same time it is tough to take the courage and hit the 'Delete' and confirm 'Yes'. 
StuartCarey
50%
50%
StuartCarey,
User Rank: Apprentice
7/8/2014 | 5:42:52 AM
Not Applicable to everything
Although a very interesting read, this is not always applicable to everything, such as senstive data sets (ie; Healthcare and NHS Where I work in the UK).

I tend to keep everything to allow us to do a previous year by year comparison on growth of a subject and more. Ie; 2 years ago there was no "SOP" in place, and now there is, this has been the change in data.

 
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.