Big Data // Big Data Analytics
News
10/16/2013
12:54 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%
Repost This

Hadoop 2.0 Goes GA: New Workloads Await

Apache's Hadoop upgrade, now in general availability, goes beyond MapReduce and promises better options for SQL-style querying, graph analysis and stream processing.

Standard advice in public speaking is to tell the audience what you're going to tell them, then tell them and, finally, tell them what you told them.

So it is in open source software release cycles. The Apache Foundation and its self-appointed surrogate, Hortonworks, have been telling us what's coming in Hadoop 2.0. They recently told us again that the release was imminent. On Wednesday came the announcement that it's finally here, meaning generally available for download.

Do you need to hear, yet again, what's new in Hadoop 2.0? The big new piece is YARN (a mangled acronym for Yet Another Resource Manager), a cluster resource management layer that will enable Hadoop to handle much more than batch-oriented MapReduce jobs. With YARN you can assign cluster capacity accordingly in order to meet the service level demands of particular workloads.

Thus, in Hadoop 2.0, big, resource-sucking MapReduce jobs can co-exist with HBase workloads and Hive queries, for example. In the Hadoop 1.0 world, companies often deployed separate clusters for HBase and MapReduce work in order to avoid system contention.

[ Want more on Teradata's alternative for many Hadoop workloads? Read Teradata Brings Graph Analysis To SQL. ]

YARN also promises better support for a host of emerging Hadoop workloads, including Storm, a stream-processing platform developed by Twitter, Apache Giraph, the open source graph analysis engine, and Spark, a tool for in-memory analytics on top of Hadoop. Storm recently officially became an Apache open source project, and Hortonworks announced Monday that it will make a preview Storm integration available in Q4 to be followed by a general release in the Hortonworks Data Platform in Q1 of 2014.

"One of the most common use cases that we see emerging from our customers is ... stream processing in Hadoop," wrote Bob Page, VP of products at Hortonworks, in a blog this week. "Early adopters are using stream processing to analyze some of the most common new types of data such as sensor and machine data in real time."

Hadoop 2.0 will also better support SQL-on-Hadoop options, though each Hadoop distributor seems to have its own prescription for how best to handle that big demand. Cloudera's answer is Impala. Hortonworks is sticking with Hive, which is supported with new elements of Hadoop 2.0 including the Tez execution engine. IBM has BigSQL, MapR has proposed Apache Drill. Pivotal is promoting its HAWQ technology, which is derived in part from its Greenplum database.

At InformationWeek, we've recently observed that relational database vendors including Oracle and Teradata have been dwelling on the shortcomings of Hadoop, but mostly it's a look backwards at Hadoop 1.0. To tell you again what's coming in Hadoop 2.0, think beyond batch MapReduce toward new, resource-managed workloads including SQL-like querying, HBase NoSQL database operations, Giraph graph analysis and Storm real-time processing.

IT leaders must know the trade-offs they face to get NoSQL's scalability, flexibility and cost savings. Also in the When NoSQL Makes Sense issue of InformationWeek: Oregon's experience building an Obamacare exchange. (Free registration required.)

Comment  | 
Print  | 
More Insights
InformationWeek Elite 100
InformationWeek Elite 100
Our data shows these innovators using digital technology in two key areas: providing better products and cutting costs. Almost half of them expect to introduce a new IT-led product this year, and 46% are using technology to make business processes more efficient.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Government, May 2014
Protecting Critical Infrastructure: A New Approach NIST's cyber-security framework gives critical-infrastructure operators a new tool to assess readiness. But will operators put this voluntary framework to work?
Video
Slideshows
Twitter Feed
Audio Interviews
Archived Audio Interviews
GE is a leader in combining connected devices and advanced analytics in pursuit of practical goals like less downtime, lower operating costs, and higher throughput. At GIO Power & Water, CIO Jim Fowler is part of the team exploring how to apply these techniques to some of the world's essential infrastructure, from power plants to water treatment systems. Join us, and bring your questions, as we talk about what's ahead.