The best insights come from data you've just collected, not the musty bits you've saved for years, argues SumAll's CEO.
Hadoop Jobs: 9 Ways To Get Hired
(Click image for larger view and slideshow.)
The save-everything mantra chanted by many big data proponents is a waste of money and resources, as organizations will gain little, if any, actionable insights from massive stockpiles of archived data. Rather, the real big data payback comes from near-real-time analysis of information as it's collected.
So says Dane Atkinson, CEO of SumAll, a three-year-old data analytics startup based in New York City. SumAll's platform takes in data from a variety of sources, including social media, email, and e-commerce, and allows companies to analyze the information right away.
Given the real-time nature of SumAll's business, perhaps it's no surprise that its CEO would preach the benefits of fast-acting data analysis. Then again, Atkinson isn't the only big data player to point out the shortcomings of information hoarding.
In a phone interview with InformationWeek, Atkinson noted that companies often warehouse big data at great expense, even when they're not sure what insights they'll gain from it. And if they don't know which questions to ask of it today, they're hopeful the astute queries will come months, or even years, down the road.
"That's the theory. That's exactly it: 'We don't know smart questions to ask now, so we're going to keep it all so that we can ask them later,'" said Atkinson, distilling the common rationale behind data hoarding, which he considers an expensive process with a dubious ROI.
"It costs a lot of money," he said. "It costs us millions of dollars a year to store our customers' data."
But despite the expense, the popular trend is to save it all.
"It's not even a question. Every company, every Internet company, tries to store all the data they possibly can," he claimed. "They believe in this theory of big data, that it'll someday be valuable."
Atkinson wasn't suggesting that companies stop storing data altogether, but rather that they do so more efficiently and with a clearly defined strategy.
"We would highly discourage storing it in a fashion that's sort of the definition of big data -- where you have it in some SSD environment on Amazon, or on a rack of servers that are costing you a fortune -- because you're not getting value out of it," he said. "You're not asking questions because it's just too big."
Still, companies often become data hoarders.
"They're living in the hoarder's environment," said Atkinson. "They're taking in all the data and putting it into a repository."
One alternative: Rather than saving every bit, companies should determine the questions they want to ask of their data, and then store the indexes they really need, a move that "will take your data down by many factors," he claimed.
Take a retail business, for instance.
"You may not need to have every second's worth of transactional history over the last four years, but it's probably pretty handy to know how [each] day went," said Atkinson. "So rolling up those 60 minutes into an hour metric [will] give your team really good guidance on the trends and patterns they want to see."
Rather than storing, say, the 2 billion transactions your business did in the past two years, save an index that tallies the hourly transaction totals during that period, he added.
This approach can greatly reduce the size of your data hoard -- "gigabytes versus terabytes," claimed Atkinson.
Again, however, he finds few businesses are slimming their data stockpiles.
"It's only the really smart companies that have started to pare that down," said Atkinson. "They may have the hoarder's closet somewhere, but they've also made a new [data] store that's much more efficient, that tries to answer smart questions and not just grab hold of everything."
InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.
Jeff Bertolucci is a technology journalist in Los Angeles who writes mostly for Kiplinger's Personal Finance, The Saturday Evening Post, and InformationWeek. View Full Bio
6 Tools to Protect Big DataMost IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Big Data Brings Big Security ProblemsWhy should big data be more difficult to secure? In a word, variety. But the business won’t wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.