Stop trying to make last year's solution fit next year's problems. Develop multiple plans for your big data.
Rarely does one catch term ignite an entire market, but in the world of IT, Big Data is it. But big data has a thousand definitions, rendering the term effectively meaningless, so allow me to bring the hype back to earth.
Simply put, big data applies to any dataset that breaks the boundaries and conventional capabilities of IT. Big data's defining characteristic could be scale--capacity is the easiest thing to get your brain around. Sheer volume of content can blow up your data center's existing capabilities. It could be the amount of transactions you need to do.
Big data is really a cause. A new approach to dealing with it is the effect, which is what's important. The effect will change everything.
History And Confusion
Big data is often equated to analytics, and while analytics is one use case, it's by no means the only one. However, it's a good place to start to understand how we got here. In short, we start with the concept of "My Data"--the data from a person, for example.
My Enterprise Strategy Group colleague Julie Lockner created a Structured Data Reference Model that tracks the life of My Data, which makes it easier to understand how something small ends up so very large. In this model, data that's created lives within a transaction processing system. While this model may vary from organization to organization and application to application, generally speaking, four data lifecycles are initiated when data is created: transaction processing, reporting and analytics, backup or disaster recovery, and application testing and development.
Data, created once, is replicated to these four functions, just within the domain of the transaction processing system. The first level of analytics exists within the transaction processing system itself (completed transactions, failures, etc.). The data is then prepared, processed, transformed, and replicated outside of the transaction processing system to be housed inside a data warehousing system, where one may perform analytics on a group of My Data records, looking for sales based on geographies, for example. That data warehouse also will require data protection and disaster recovery functions, and other copies will be required for test/development.
Then, all the My Data objects are transformed, processed, and replicated to a "Big Analytics" system, where it's pored over for shopping cart dropout rates and other cause/effect scenarios. Again, copies of copies are used for test/development, backup, and DR.
Wow. It doesn't take long to see how one little transaction record can grow 100-fold. Sooner or later, that growth will break the capabilities of conventional IT.
To steal a line from Julie: "More than just data volume, smart big data strategies also consider the velocity, variety, and complexity of information." Data sources aren't just simple transaction processing systems. They come from social media, they include dozens of content types (video, audio, etc.), and they come from every known device on the planet.
No wonder the industry is so fired up about big data. The advances create new opportunities for your company to sell more stuff--and for companies to sell more to you. It also means new opportunities to screw up.
So what breaks when you cross the tipping point of big data? You first find that all the fundamentals break. For instance, you can't process all the data any longer, so you start to process only sub-groups, and then you hope the groups you chose are fair representations of the overall data pool (they aren't). You're using traditional structured database systems that no longer work because your datasets are 1,000 times bigger then the DBMS was ever designed to support. You can't inject your data into your analytics (or any other) system fast enough. You can't grow your storage infrastructure fast enough. You can't back up the data fast enough, so the concept of recovery is completely shot.
So what do you do? You stop trying to make last year's solution fit next year's problems.
Tons of technologies are being developed to address these issues across the board. Most are simply Band-Aids. Others, like Hadoop, are more radical and will fundamentally change the way you do things (storage, in this case). Most need more time to develop into legitimate enterprise alternatives, but they're on the way.
Meanwhile, the next time someone asks, "What's your plan for big data?" respond, "Which one?" You're going to need a few.
Steve Duplessie is the founder and senior analyst at the Enterprise Strategy Group, a leading independent authority on enterprise storage, analytics, and a range of other business technology interests.
It's time to get going on data center automation. The cloud requires automation, and it'll free resources for other priorities. Download InformationWeek's Data Center Automation special supplement now. (Free registration required.)