Machine Learning & AI

Big Data: No Hoarding Allowed

The best insights come from data you've just collected, not the musty bits you've saved for years, argues SumAll's CEO.

Jeff Bertolucci, Contributor

July 7, 2014

4 Min Read

(Source: <a href="http://commons.wikimedia.org/wiki/File:BinaryData50.png" target="_blank">W.Rebel</a>)

Hadoop Jobs: 9 Ways To Get Hired

Hadoop Jobs: 9 Ways To Get Hired (Click image for larger view and slideshow.)

The save-everything mantra chanted by many big data proponents is a waste of money and resources, as organizations will gain little, if any, actionable insights from massive stockpiles of archived data. Rather, the real big data payback comes from near-real-time analysis of information as it's collected.

So says Dane Atkinson, CEO of SumAll, a three-year-old data analytics startup based in New York City. SumAll's platform takes in data from a variety of sources, including social media, email, and e-commerce, and allows companies to analyze the information right away.

Given the real-time nature of SumAll's business, perhaps it's no surprise that its CEO would preach the benefits of fast-acting data analysis. Then again, Atkinson isn't the only big data player to point out the shortcomings of information hoarding.

In a phone interview with InformationWeek, Atkinson noted that companies often warehouse big data at great expense, even when they're not sure what insights they'll gain from it. And if they don't know which questions to ask of it today, they're hopeful the astute queries will come months, or even years, down the road.

[Leave the geek-speak at the office. Learn How To Explain Big Data To A 5th Grader.]

"That's the theory. That's exactly it: 'We don't know smart questions to ask now, so we're going to keep it all so that we can ask them later,'" said Atkinson, distilling the common rationale behind data hoarding, which he considers an expensive process with a dubious ROI.

"It costs a lot of money," he said. "It costs us millions of dollars a year to store our customers' data."

But despite the expense, the popular trend is to save it all.

"It's not even a question. Every company, every Internet company, tries to store all the data they possibly can," he claimed. "They believe in this theory of big data, that it'll someday be valuable."

Atkinson wasn't suggesting that companies stop storing data altogether, but rather that they do so more efficiently and with a clearly defined strategy.

"We would highly discourage storing it in a fashion that's sort of the definition of big data -- where you have it in some SSD environment on Amazon, or on a rack of servers that are costing you a fortune -- because you're not getting value out of it," he said. "You're not asking questions because it's just too big."

Still, companies often become data hoarders.

"They're living in the hoarder's environment," said Atkinson. "They're taking in all the data and putting it into a repository."

One alternative: Rather than saving every bit, companies should determine the questions they want to ask of their data, and then store the indexes they really need, a move that "will take your data down by many factors," he claimed.

Take a retail business, for instance.

"You may not need to have every second's worth of transactional history over the last four years, but it's probably pretty handy to know how [each] day went," said Atkinson. "So rolling up those 60 minutes into an hour metric [will] give your team really good guidance on the trends and patterns they want to see."

Rather than storing, say, the 2 billion transactions your business did in the past two years, save an index that tallies the hourly transaction totals during that period, he added.

This approach can greatly reduce the size of your data hoard -- "gigabytes versus terabytes," claimed Atkinson.

Again, however, he finds few businesses are slimming their data stockpiles.

"It's only the really smart companies that have started to pare that down," said Atkinson. "They may have the hoarder's closet somewhere, but they've also made a new [data] store that's much more efficient, that tries to answer smart questions and not just grab hold of everything."

InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.

About the Author

Jeff Bertolucci

Contributor

Jeff Bertolucci is a technology journalist in Los Angeles who writes mostly for Kiplinger's Personal Finance, The Saturday Evening Post, and InformationWeek.

See more from Jeff Bertolucci

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

Big Data: No Hoarding Allowed

About the Author

Editor's Choice

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

<span class="ArticleBase-LargeTitle">Big Data: No Hoarding Allowed</span>Big Data: No Hoarding Allowed

About the Author

Editor's Choice

Big Data: No Hoarding Allowed