Everything old is new again. I've heard this adage many times but only recently realized what it really means: The old thing isn't new again; it's just technically feasible now.
Take Big Data, which isn't a new concept. Companies have been dealing with large datasets for a long time, almost since the invention of the computer. What has changed is the way we deal with that data.
It started with the rise of cheap computing and storage. It became possible to store vast amounts of data and then split up workloads to have each computer deal with a small piece of work that it can complete in a short time. We can store and measure more as a result of technological advancements, and people are collecting more data than ever before. This data screams to be mined for valuable insights.
The next jump in the Big Data concept came with cloud computing. We now can summon hundreds or thousands of virtual computers with just a single command, process a large workload, and then return those computers to the resource pool.
So what is Big Data anyway, and why is it so popular now? The definition I like: Big Data refers to the tools and processes of managing and utilizing large datasets. Since I'm a big proponent of public clouds, my definition also includes the use of virtualization, but yours doesn't have to. Now that we're finally able to affordably store and process petabytes of data, Big Data is something accessible to everyone, not just the biggest companies.
Google is a master of Big Data. The company's director of research, Peter Norvig, points out that a simple algorithm applied to a large dataset can be much more useful than a complex algorithm on a small dataset. One example is how Google can predict flu outbreaks before the Centers For Disease Control does. Using search terms, it can find when many people in one city are searching for flu remedies or terms related to flu symptoms. Using its vast computing resources, Google can scan all the search terms in near real time, simply count the occurrences of the indicator words, and then store the location of those searches. It can then process that list of locations, looking for "hotspots" where it comes up often and predict with some certainty that a flu outbreak is imminent.
But you don't have to be Google to use Big Data. Does your company collect information about its customers? Could it be collecting more, especially with the expansion of mobile computing via smartphones? With cheap storage and computers, your company can collect the kind of customer data Amazon.com did to change the face of retail.
With any discussion of Big Data, NoSQL isn't far behind. Fundamentally, it's the idea of storing data in a non-relational way, sans schema. One big advantage of NoSQL is that it lets us consume multiple data sources, so we aren't dependent on the data conforming to one particular standard.
If you've been working with large datasets (or even small ones), you're familiar with structured data that is accessed with some form of SQL. SQL is really good at answering specific questions, like, "How much do we pay all the people in our company that have the name Jason and a spouse named Laura?" To answer that question, we need to have a table with columns of employee first names, salaries, and spouse names, and that table has to be defined before we put any data into it.
That table definition is called a schema, and not having one means that we have a lot more flexibility with our data storage but need to do more work to answer that same question. It would take at least two passes over all the data--once to find all the people with the name Jason, then a second pass to find the ones with spouses named Laura. However, you could split this work across many computers to speed it up, which is what MapReduce is all about. You could also decide later on to start tracking employee spouses' names, and you wouldn't have to alter any tables to do it.
With the rising popularity of virtualized computing, especially public clouds, comes an explosion in Big Data technologies such as MapReduce, Hadoop, and Hive. MapReduce, a concept popularized by Google and implemented in the open source Hadoop, is a technique that allows for the division of workloads across multiple servers. Hive then sits on top of Hadoop, bringing back some of the SQL functionality that many are accustomed to.
Ultimately, Big Data encompasses a wide variety of concepts and technologies, but in the end it doesn't really matter. What matters is what you do with your data. Stay tuned for my next Big Data column, where I'll cover the technologies in more depth.
Jeremy Edberg is the lead reliability engineer for Netflix, former operations manager for Reddit, and is a conference chair for UBM TechWeb's Cloud Connect, Feb. 13 to 16 in Santa Clara, Calif. You can find Jeremy at jedberg.net or follow him on Twitter @jedberg.
It's time to get going on data center automation. The cloud requires automation, and it'll free resources for other priorities. Download InformationWeek's Data Center Automation special supplement now. (Free registration required.)