Apache's Hadoop upgrade, now in general availability, goes beyond MapReduce and promises better options for SQL-style querying, graph analysis and stream processing.
Standard advice in public speaking is to tell the audience what you're going to tell them, then tell them and, finally, tell them what you told them.
So it is in open source software release cycles. The Apache Foundation and its self-appointed surrogate, Hortonworks, have been telling us what's coming in Hadoop 2.0. They recently told us again that the release was imminent. On Wednesday came the announcement that it's finally here, meaning generally available for download.
Do you need to hear, yet again, what's new in Hadoop 2.0? The big new piece is YARN (a mangled acronym for Yet Another Resource Manager), a cluster resource management layer that will enable Hadoop to handle much more than batch-oriented MapReduce jobs. With YARN you can assign cluster capacity accordingly in order to meet the service level demands of particular workloads.
Thus, in Hadoop 2.0, big, resource-sucking MapReduce jobs can co-exist with HBase workloads and Hive queries, for example. In the Hadoop 1.0 world, companies often deployed separate clusters for HBase and MapReduce work in order to avoid system contention.
YARN also promises better support for a host of emerging Hadoop workloads, including Storm, a stream-processing platform developed by Twitter, Apache Giraph, the open source graph analysis engine, and Spark, a tool for in-memory analytics on top of Hadoop. Storm recently officially became an Apache open source project, and Hortonworks announced Monday that it will make a preview Storm integration available in Q4 to be followed by a general release in the Hortonworks Data Platform in Q1 of 2014.
"One of the most common use cases that we see emerging from our customers is ... stream processing in Hadoop," wrote Bob Page, VP of products at Hortonworks, in a blog this week. "Early adopters are using stream processing to analyze some of the most common new types of data such as sensor and machine data in real time."
Hadoop 2.0 will also better support SQL-on-Hadoop options, though each Hadoop distributor seems to have its own prescription for how best to handle that big demand. Cloudera's answer is Impala. Hortonworks is sticking with Hive, which is supported with new elements of Hadoop 2.0 including the Tez execution engine. IBM has BigSQL, MapR has proposed Apache Drill. Pivotal is promoting its HAWQ technology, which is derived in part from its Greenplum database.
At InformationWeek, we've recently observed that relational database vendors including Oracle and Teradata have been dwelling on the shortcomings of Hadoop, but mostly it's a look backwards at Hadoop 1.0. To tell you again what's coming in Hadoop 2.0, think beyond batch MapReduce toward new, resource-managed workloads including SQL-like querying, HBase NoSQL database operations, Giraph graph analysis and Storm real-time processing.
IT leaders must know the trade-offs they face to get NoSQL's scalability, flexibility and cost savings. Also in the When NoSQL Makes Sense issue of InformationWeek: Oregon's experience building an Obamacare exchange. (Free registration required.)
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.