Why Big Is Bad When It Comes To Data
Calling it "big data" doesn't do it justice. Gushing data would be far more accurate.
Take "big data." It's the catchphrase du jour. You hear it everywhere. The tech media, including InformationWeek, covers it thoroughly. Database and analytics vendors are glomming on to it for the cachet it gives their marketing efforts. I had to grin when SAS CEO Jim Goodnight, a wizened figure if ever there was one, properly scoffed in a recent interview with InformationWeek's Doug Henschen that "we're talking about big data now because everyone got tired of talking about the cloud."
There's nothing inherently wrong with being a new thing. Trouble is the term is just so imprecise. What's it say when the generally authoritative Wikipedia describes "big data" right off the bat as a "loosely defined term"?
Lately, my meanderings have taken me into a number of encounters with some of the best minds dealing with "big data," including researchers from Intel and MIT, hands-on executive managers at companies such as LinkedIn, eBay, and Adobe, and entrepreneurs such as Ash Damle of MEDgle.
And the more I bump into the topic of "big data" the more concerned I've become about the term itself. Reason: It falls so far short of not only describing the phenomenon, but also its applications, opportunities, and ramifications--for IT, business, the way we live and work, too.
[ Entrepreneurship has a strong pull for many of our best and brightest. Is The Corporate Brain Drain Inevitable? ]
Unless you're a computer science PhD or a database professional, it's easy to take the term literally. And among those who do, don't forget, are the corporate execs and line-of-business managers with whom even those of you in the know must deal. To them, "big" is just about the amount. It's not difficult to imagine the petabytes piling up out there, given the contrail of information everyone exhausts as they move across the various fixed and mobile networks.
Of course, volume is the most immediate issue many of you face in dealing with your data. At a big data panel held at Google's Silicon Valley HQ last week, the participants addressed at length the costs of warehousing, and along two dimensions--size and duration. It's not just how much data you want to process and store but for how long. And they also raised the issue of diminishing returns. When do the costs of keeping and sifting over time outweigh practical benefit?
Data isn't static, like standing waters of a reservoir. It's increasingly dynamic, generated and collected in real time. Even transactional data is being captured at both ends--and at every point in between. Ergo, data gushes.
And it gushes from an expanding number of sources, including all the sensors monitoring more and more of what we do. One of my favorite examples comes from Eve M. Schooler, an Intel R&D principal, who pointed out that public utility smart meters in many municipalities now report energy usage every 15 minutes--frequently enough to discern any number of behavioral patterns, such as when you're home (or not), alone or with others. And that's just one silent stream.
Those three "V's"--volume, velocity, and variety--go back a ways, of course. Gartner market analyst Doug Laney used them to describe big data as far back as 2001. But it doesn't hurt to revive aged, but still valid, thinking if only because "big data," properly defined, will present a multitude of challenges to many of you reading this, and soon enough.
One is analytics. MIT's Stonebraker contends that the "simple analytics" that data warehouses can apply to relational databases just aren't up to the complex, covariant calculations required to tap the probabilities and predictive insights--the real gold--within the gushing streams of unstructured data spouting up everywhere.
To make his point about the limitations of relational databases and the simple analytics applied to them, Stonebraker cites one pharmaceutical company trying to mine the data being captured by its 8,000 research scientists, each with an individual electronic Web notebook. Imagine the payoff, he suggests, in finding a groundbreaking new drug out of probabilistic connections between one researcher's works seemingly so far from another's in distance and subject matter. While there are informatics systems capable of integrating 10 data sources, there are none that can choke down thousands, Stonebraker said. "Hell will freeze over before you get it done," he said.
Finally, describing data as nothing more than "big" makes it seem too benign. There ought to be an adjective that at least hints to the grave implications to privacy lurking ahead as companies, governments, and heaven-knows-who-else become ever more adept at collecting, storing, processing, analyzing, and visualizing data.
As you might expect, Stonebraker foresees momentous economic and social value. At the same time, he also sees the dark side. "Privacy is going to be a huge issue," he said. "And it's largely going to be a political issue, too."
So if "big" doesn't cut it as an appropriate modifier, then what does? Maybe we should start an effort to call it something else, before the term is popularized beyond any redemption whatsover.
"Gush data" anyone?
Patrick Houston is the co-founder of MediaArchitechs. He is a former SVP for a new media startup, a GM at Yahoo, and editor-in-chief at CNET.com. He can be reached at email@example.com.