Cloud // Infrastructure as a Service
Commentary
1/4/2012
09:13 AM
Commentary
Commentary
Commentary
Connect Directly
RSS
E-Mail
50%
50%

Big Data: Why All The Fuss?

Big Data can be mined even if you're not a big company, thanks to cheap computing power and storage and new cloud-based technologies.

Everything old is new again. I've heard this adage many times but only recently realized what it really means: The old thing isn't new again; it's just technically feasible now.

Take Big Data, which isn't a new concept. Companies have been dealing with large datasets for a long time, almost since the invention of the computer. What has changed is the way we deal with that data.

It started with the rise of cheap computing and storage. It became possible to store vast amounts of data and then split up workloads to have each computer deal with a small piece of work that it can complete in a short time. We can store and measure more as a result of technological advancements, and people are collecting more data than ever before. This data screams to be mined for valuable insights.

The next jump in the Big Data concept came with cloud computing. We now can summon hundreds or thousands of virtual computers with just a single command, process a large workload, and then return those computers to the resource pool.

So what is Big Data anyway, and why is it so popular now? The definition I like: Big Data refers to the tools and processes of managing and utilizing large datasets. Since I'm a big proponent of public clouds, my definition also includes the use of virtualization, but yours doesn't have to. Now that we're finally able to affordably store and process petabytes of data, Big Data is something accessible to everyone, not just the biggest companies.

Google is a master of Big Data. The company's director of research, Peter Norvig, points out that a simple algorithm applied to a large dataset can be much more useful than a complex algorithm on a small dataset. One example is how Google can predict flu outbreaks before the Centers For Disease Control does. Using search terms, it can find when many people in one city are searching for flu remedies or terms related to flu symptoms. Using its vast computing resources, Google can scan all the search terms in near real time, simply count the occurrences of the indicator words, and then store the location of those searches. It can then process that list of locations, looking for "hotspots" where it comes up often and predict with some certainty that a flu outbreak is imminent.

But you don't have to be Google to use Big Data. Does your company collect information about its customers? Could it be collecting more, especially with the expansion of mobile computing via smartphones? With cheap storage and computers, your company can collect the kind of customer data Amazon.com did to change the face of retail.

With any discussion of Big Data, NoSQL isn't far behind. Fundamentally, it's the idea of storing data in a non-relational way, sans schema. One big advantage of NoSQL is that it lets us consume multiple data sources, so we aren't dependent on the data conforming to one particular standard.

Global CIO
Global CIOs: A Site Just For You
Visit InformationWeek's Global CIO -- our online community and information resource for CIOs operating in the global economy.

If you've been working with large datasets (or even small ones), you're familiar with structured data that is accessed with some form of SQL. SQL is really good at answering specific questions, like, "How much do we pay all the people in our company that have the name Jason and a spouse named Laura?" To answer that question, we need to have a table with columns of employee first names, salaries, and spouse names, and that table has to be defined before we put any data into it.

That table definition is called a schema, and not having one means that we have a lot more flexibility with our data storage but need to do more work to answer that same question. It would take at least two passes over all the data--once to find all the people with the name Jason, then a second pass to find the ones with spouses named Laura. However, you could split this work across many computers to speed it up, which is what MapReduce is all about. You could also decide later on to start tracking employee spouses' names, and you wouldn't have to alter any tables to do it.

With the rising popularity of virtualized computing, especially public clouds, comes an explosion in Big Data technologies such as MapReduce, Hadoop, and Hive. MapReduce, a concept popularized by Google and implemented in the open source Hadoop, is a technique that allows for the division of workloads across multiple servers. Hive then sits on top of Hadoop, bringing back some of the SQL functionality that many are accustomed to.

Ultimately, Big Data encompasses a wide variety of concepts and technologies, but in the end it doesn't really matter. What matters is what you do with your data. Stay tuned for my next Big Data column, where I'll cover the technologies in more depth.

Jeremy Edberg is the lead reliability engineer for Netflix, former operations manager for Reddit, and is a conference chair for UBM TechWeb's Cloud Connect, Feb. 13 to 16 in Santa Clara, Calif. You can find Jeremy at jedberg.net or follow him on Twitter @jedberg.

It's time to get going on data center automation. The cloud requires automation, and it'll free resources for other priorities. Download InformationWeek's Data Center Automation special supplement now. (Free registration required.)

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
thegreeneman
50%
50%
thegreeneman,
User Rank: Apprentice
1/6/2012 | 5:28:18 PM
re: Big Data: Why All The Fuss?
Great to see some reality checking, love the comment, "its just technically feasible now". One of the things I think needs to happen next is that non-google companies, companies with only a few developers and limited budgets can get access to the power of unlocked Big Data. Most of the today's Big Data solutions for the common company are only addressing one end of the specturm ...scale. Old SQL did a great job of managing information model complexity, but sucked at scale. Emerging Big Data solutions ( e.g. Key:Value, Columnar, Document stores )have thus far only really addressed scale for the common company. However, the reality is that the world is getting more complex, not simpler and truly long term viable Big Data solutions must come to terms with this fact. They need to address scale, but they also need to address information model complexity in a way much closer to what was possible with old SQL. Old is new, old is well, just now technically feasible ..

Objects, Graphs, Networks .... dealing with complexity at scale.... the future is inevitable... can people see it? If you are interested in dealing with the truth check out:
Versant - managing information complexity at scale. http://www.versant.com
Trendwise
50%
50%
Trendwise,
User Rank: Apprentice
1/5/2012 | 10:32:51 AM
re: Big Data: Why All The Fuss?
Nice article Jeremy. Another simple use case for Big Data is social media analytics and Sentiment analysis which involves large amounts of unstructured data.
Looking forward to your next articles.
-Trendwise Analytics
TealeafFan
50%
50%
TealeafFan,
User Rank: Apprentice
1/5/2012 | 12:22:16 AM
re: Big Data: Why All The Fuss?
That is what drew me to Tealeaf, the company I presently work for. Stores entire sessions of data to more easily find issues and opportunities.
Multicloud Infrastructure & Application Management
Multicloud Infrastructure & Application Management
Enterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest September 24, 2014
Start improving branch office support by tapping public and private cloud resources to boost performance, increase worker productivity, and cut costs.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.