Big Data Debate: Will Hadoop Become Dominant Platform?
Will Hadoop become the hub from which most data management activities will either integrate or originate? Two big data experts square off.
Apache's Hadoop framework has become synonymous with the big data movement, but is it destined to become the dominant data management platform for us all?
In our first big data debate in December, we asked, "Is the end near for data warehousing?" with Hadoop forever changing our notion of the enterprise data warehouse. But with the emergence of several high-profile SQL-on-Hadoop options and projects in recent weeks and months -- including Cloudera's Impala, the Apache Drill effort led by MapR, IBM BigSQL, Hortonworks' Stinger project, and EMC's Pivotal Distribution -- the question gets bigger, and it's time to revisit this topic.
The supposition is not that Hadoop will totally replace relational databases or other tools. But is Hadoop destined to become the high-scale centerpiece or hub from which most data management activities and analyses will either integrate or originate?
Read what our two big data experts have to say, then weigh in with your opinion in the comments section below.
For The Motion
David Menninger Head of Business Development and Strategy, Pivotal Initiative, EMC
Hadoop Will Lead The Way
We used to ask whether we could afford to store information. Today we ask whether we can afford to throw it away. This is not a technology argument; it is economic. Hadoop is part of the reason.
Hadoop has fundamentally changed the economics of storing and analyzing information. As recently as five years ago, a scalable relational database cost $100K per terabyte for a perpetual software license, plus $20K per year for maintenance and support. Today you can store, manage and analyze the same amount of information with a $1,200/year subscription. This difference in economics has attracted a lot of attention and will make Hadoop the centerpiece from which most large-scale data management activities and analyses will either integrate or originate.
Initially, relational databases and Hadoop varied greatly in their capabilities. Hadoop required lots of hand coding and had a limited ecosystem of supporting tools, but with enough effort you could do amazing things over large volumes of data. And research that I conducted showed that people are doing new and amazing things with Hadoop, not just replacing their existing technologies, relational or otherwise.
Now, with more robust SQL capabilities being married to the Hadoop infrastructure and bringing the entire SQL-based ecosystem to the world of Hadoop, the market has expanded by one or two orders of magnitude. No longer is Hadoop just the domain of specialists. Sure, there are certain things that you will be able to do only by getting down and dirty with Hadoop, but now the results of that work can be shared more easily with so many others.
Let's look at the counterargument. How could the relational database market respond? If you agree that economics are driving much of this shift, then the obvious response is lower prices. In recent years, prices have declined based on competition within the scalable relational database market as well as competition with Hadoop. However, even with lower price points, scaling relational technology to the same scale as Hadoop is a stretch. Alternatively, relational databases can embrace Hadoop as a way to compete at scale. We've seen this happen. Most scalable relational databases have a way to connect with or leverage Hadoop. The inevitable result is the collision of these technologies that we are experiencing right now.
Relational technologies operating on their own face are another obstacle. The genie is out of the bottle with respect to unstructured/multi-structured/loosely-structured data. I won't debate here which is the best term, but with Hadoop we now analyze more and different information than we can with relational databases.
So, with the ability to analyze more information in more ways using the same tools used throughout the organization, what's to prevent Hadoop from becoming the default data platform for projects with any scale?
David Menninger is the head of business development and strategy for The Pivotal Initiative (Greenplum, a division of EMC). David also served as VP and research director for Ventana Research covering big data, analytics, and information management.
Against The Motion
James Kobielus IBM Big Data Evangelist
SQL Is The Centerpiece
Hadoop's footprint will continue to grow for some time in the big data arena, especially as the core open-source technologies evolve and enterprises invest more heavily in the technology. However, Hadoop will be neither the dominant platform nor the architectural centerpiece of most enterprise big data deployments. But that also applies to any other big data platforms, current or emerging, that you might name.
Why is this? Because hybrid big data architectures -- which combine two or more technologies in specialized roles -- are becoming dominant. Hybrid deployments -- combining Hadoop with in-memory, key-value stores, massively parallel processing (MPP) RDBMSs, stream computing, graph databases, and other platforms -- are already widespread in real-world deployments. Each platform has specific advantages in terms of the "3 Vs" of scalability, and also in terms of supporting specific categories of data sources, analytical workloads, deployment roles and downstream applications. And each type of technology is, like Hadoop, experiencing rapid adoption in real-world applications.
Many users have come to realize that no one type of big data platform is optimal for all requirements. Frequently, we see customers adopt Hadoop for specific roles -- especially exploratory data-science sandboxes and unstructured data staging -- while relying on in-memory for front-end BI query acceleration, stream computing for continuous data ingest and MPP RDBMS for data warehousing and master data management. As more NoSQL, NewSQL and other big data approaches come to market, each will also gain acceptance only through its ability to specialize in roles for which more established platforms are suboptimal.
The architectural centerpiece of this new big data landscape must be a standard query-virtualization or abstraction layer that supports transparent SQL access to any and all back-end platforms. SQL will continue to be the lingua franca for all analytics and transactional database applications. Consequently, big data solution providers absolutely must allow SQL developers to transparently tap into the full range of big data platforms, current and future, without modifying their code.
Unfortunately, the big data industry still lacks a consensus query-virtualization approach. Today's big data developers must wrangle with a plethora of SQL-like languages for big data access, query and manipulation, including HiveQL, CassandraQL,
JAQL, SQOOP, Sparql, Shark, and DrQL. Many, but not all, of these are associated with a specific type of big data platform -- most often, it's with Hadoop. I'm including
IBM's "BigSQL" (currently in Technology Preview) in this list of industry initiatives.
The fact that we refer to many of these initiatives as "SQL-on-Hadoop" is a danger sign. We as an industry need to go one step beyond. The big data arena threatens to split into diverse, siloed platforms unless we bring SQL fully into it all as a lingua franca.
Siloed query languages and frameworks threaten to ramp up the cost, complexity, incompatibility, risk and unmanageability of multiplatform big data environments. The situation is likely to grow more fragmented as big data innovation intensifies and hybrid deployments predominate -- unless we put standardization on the industry front burner.
Having a unified query virtualization layer would enable more flexible, heterogeneous topologies of Hadoop and non-Hadoop platforms in a common architecture. It would also reduce the need for big data adopters to write custom integration code to access it all behind a common development abstraction.
James Kobielus is IBM's big data evangelist. He is an industry veteran who spearheads IBM's thought leadership activities in big data, data science, enterprise data warehousing, advanced analytics, Hadoop, business intelligence, data management and next best action technologies.
How Enterprises Are Attacking the IT Security EnterpriseTo learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
IT Strategies to Conquer the CloudChances are your organization is adopting cloud computing in one way or another -- or in multiple ways. Understanding the skills you need and how cloud affects IT operations and networking will help you adapt.