Hadoop According To Hortonworks: An Insider's View
Shaun Connolly, Hortonworks VP of corporate strategy, dishes on Hadoop 2.0., competition with Cloudera and the threat of big data commoditization.
IW: How does Hadoop fit into the overall enterprise data architecture?
Connolly: Companies will get started with one or two targeted applications around, say, clickstreams or sensors as the new form of data that they weren't able to use before. But I'd say a good 70% of firms are doing their initial deployments with an eye toward Hadoop as a shared data service or data lake. They want to get to that, but it tends to be a multi-year process. Within three to six months they might be able to develop one or two analytic applications.
IW: What are the other 30% doing?
Connolly: They're staying focused on very targeted analytic applications. It might be an online marketing company where they have two or three use cases. That's what they use Hadoop for and they don't need a massive data lake where they might try to drive productivity or efficiency out of other parts of the business.
IW: Speaking of analytics, there's been a big focus on SQL on Hadoop capabilities this year. Cloudera introduced Impala on this front, but Hortonworks' choice is to improve Hive with the Stinger project. What's the latest update?
Connolly: Hive up through release 0.10 was built to run on classic, batch-oriented MapReduce. Work on Hive 0.11 has optimized Hive's use of MapReduce and how it goes about generating MapReduce jobs. It has also begun to find where Hive can be smarter by, for example, spotting queries that are right for MapReduce and those that require interactive response times. Tez [a component of Hadoop 2.0] is designed to speed up the performance of Hive in those scenarios. With Stinger, the goal is to improve Hive's performance 100X, and we're probably about halfway there.
IW: What is Tez all about? That's a very new component.
Connolly: Tez is an Apache incubator project, and its job is to alleviate the MapReduce notion of having intermediate files stored and replicated three times in HDFS. That introduces a lot of latency. Tez lets you do a lot of MapReduce in memory, so you don't have to persist data in the file system at every step. Tez is better suited for SQL-type access, and Hive, in Stinger Phase 2, is taking advantage of it -- as are Apache Pig and some of the other projects in the ecosystem.
Ultimately Tez will be an always-on service so you don't have to iteratively boot up Java, run your job and shut it down. Tez will always be there to field requests and handle a lot of processing in memory, and it will be far faster than classic MapReduce. That underpins our Stinger initiative. Hive does petabyte-scale data processing extremely well. We're confident we can enable Hive to take advantage of Tez so it can handle the human-interactive use cases very well and still have just one system using a standard set of SQL.
IW: The complaints about Hive have been about the lack of SQL-like functionality as well as slow query performance. Will Stinger address both of these flaws?
Connolly: On performance, we're talking about a handful of seconds -- human-interactive response times. It's not going to be like an OLTP relational database that delivers millisecond response. That's not the goal of Stinger. It's designed for humans interacting with large sets of data in Hadoop and being able to do iterative, ad-hoc querying.
IW: Hadoop 2.0 introduces a lot of change. How quickly will it be embraced and what's going to drive adoption?
Connolly: The Hadoop 2.0 initiatives have been underway for more than three years. If you take Yahoo, all of its clusters have been moved to Hadoop 2.0 with YARN. It has been running in production for six to nine months. They were doing rollouts and testing a year before that. We're pretty confident that it's ready and bulletproof from an enterprise perspective. The level of quality will be what we're used to shipping.
When you look at new analytic applications and large, data-lake scenarios, many customers want to do MapReduce batch processing and interactive querying while using HBase as an online database. They want to be able to run those all on a cluster that has legitimate resource management [with YARN]. Right now, HBase clusters are typically separated from other Hadoop workloads primarily because you don't want an online data store competing for resources. That's not a fit with the Hadoop goal of putting all of your data in one spot.
So [YARN] resource management will be a big driver of Hadoop 2.0 adoption, but companies will also double their headroom on their Hadoop clusters. Yahoo, for example, won't have to expand its clusters for a while because it freed up a bunch of room and can run twice the number of jobs that it previously handled.
IW: How was that much capacity unlocked?
Connolly: Yahoo now runs twice the number of jobs on the same servers because YARN is able to schedule the resource requirements of those jobs much more efficiently than the old, classic MapReduce. For medium-size clusters on up, that will definitely be a pull to deploy Hadoop 2.0. Customers can also expect Hive to take advantage of Tez and YARN for better interactive querying and HBase and other workloads will be able to co-exist in a single cluster, and that will be another draw to Hadoop 2.0.