Hadoop 2.0 will move beyond batch processing to support interactive, online and streaming applications. But don't let warnings about YARN tie you up in knots.
You could use YARN to allocate resources to Impala, for example, but then Enterprise RTQ would manage the concurrent queries running inside Impala. "If these queries went directly to YARN alone -- and you can look at how it works with Hive and Stinger, for example -- you no longer have two systems managing resources on the cluster and you don't have to reinvent everything from multi-tenancy to security and all of those things," Murthy says.
This is a Hortonworks take on Hadoop, as that company is sticking with Hive and working on project Stinger as a way to drive faster query performance (45X in recent tests, according to Murthy). Impala, HAWQ and other SQL-on-Hadoop projects are offering alternatives to Hive that don't rely on MapReduce running behind the scenes (as is the case with Hive). With last month's release of Cloudera Impala, company CEO Mike Olson said of Hive "we don't believe that it's going to be possible to drive down latencies and improve performance sufficiently via that platform."
This is a side issue that doesn't take anything away from support of Hadoop 2.0 or the value of YARN. We've asked Cloudera, MapR and others for their positions on YARN, and thus far the statements of support are universal.
Cloudera is not only contributing to YARN development and shipping a preview version in it's software distribution, according to Charles Zedlewski, VP of products, it's also "undertaking some developments in Impala to better take advantage of YARN. With the way Impala was designed there is no overlapping resource management or security."
MapR is "working with the community to enhance YARN and make it more valuable," said MapR VP of Marketing Jack Norris. "For example, we are the primary contributors to Apache Drill, which is the YARN-based SQL-on-Hadoop solution." (Drill is in development and will be MapR's answer to Cloudera Impala and Stinger-improved Hive).
Customers will inevitably rule the outcome if there is any debate. If most organizations are determined to control Hadoop resources using YARN, it will be easy enough for commercial tools to be rearchitected to defer to YARN resource management controls. If proprietary tools offer some measure of added value, we just might see overlapping administrative controls. It won't be the first time.
The main point on the pending release of Hadoop and YARN is that the platform is maturing, Murthy maintains. It's a step that will help move the platform beyond the early Web-company implementations into the diverse demands of the enterprise market.
"YARN makes it easy for a lot of applications to come into the Hadoop ecosystem and it gives you a significantly better return on your Hadoop cluster," he says. "You can manage applications one way, operate one way, monitor one way and drive down the cost of running your entire data architecture."