Hadoop According To Hortonworks: An Insider's View - InformationWeek
Data Management // Software Platforms
03:12 PM
Connect Directly

Hadoop According To Hortonworks: An Insider's View

Shaun Connolly, Hortonworks VP of corporate strategy, dishes on Hadoop 2.0., competition with Cloudera and the threat of big data commoditization.

IW: How does Hadoop fit into the overall enterprise data architecture?

Connolly: Companies will get started with one or two targeted applications around, say, clickstreams or sensors as the new form of data that they weren't able to use before. But I'd say a good 70% of firms are doing their initial deployments with an eye toward Hadoop as a shared data service or data lake. They want to get to that, but it tends to be a multi-year process. Within three to six months they might be able to develop one or two analytic applications.

IW: What are the other 30% doing?

Connolly: They're staying focused on very targeted analytic applications. It might be an online marketing company where they have two or three use cases. That's what they use Hadoop for and they don't need a massive data lake where they might try to drive productivity or efficiency out of other parts of the business.

IW: Speaking of analytics, there's been a big focus on SQL on Hadoop capabilities this year. Cloudera introduced Impala on this front, but Hortonworks' choice is to improve Hive with the Stinger project. What's the latest update?

Connolly: Hive up through release 0.10 was built to run on classic, batch-oriented MapReduce. Work on Hive 0.11 has optimized Hive's use of MapReduce and how it goes about generating MapReduce jobs. It has also begun to find where Hive can be smarter by, for example, spotting queries that are right for MapReduce and those that require interactive response times. Tez [a component of Hadoop 2.0] is designed to speed up the performance of Hive in those scenarios. With Stinger, the goal is to improve Hive's performance 100X, and we're probably about halfway there.

IW: What is Tez all about? That's a very new component.

Connolly: Tez is an Apache incubator project, and its job is to alleviate the MapReduce notion of having intermediate files stored and replicated three times in HDFS. That introduces a lot of latency. Tez lets you do a lot of MapReduce in memory, so you don't have to persist data in the file system at every step. Tez is better suited for SQL-type access, and Hive, in Stinger Phase 2, is taking advantage of it -- as are Apache Pig and some of the other projects in the ecosystem.

Ultimately Tez will be an always-on service so you don't have to iteratively boot up Java, run your job and shut it down. Tez will always be there to field requests and handle a lot of processing in memory, and it will be far faster than classic MapReduce. That underpins our Stinger initiative. Hive does petabyte-scale data processing extremely well. We're confident we can enable Hive to take advantage of Tez so it can handle the human-interactive use cases very well and still have just one system using a standard set of SQL.

IW: The complaints about Hive have been about the lack of SQL-like functionality as well as slow query performance. Will Stinger address both of these flaws?

Connolly: On performance, we're talking about a handful of seconds -- human-interactive response times. It's not going to be like an OLTP relational database that delivers millisecond response. That's not the goal of Stinger. It's designed for humans interacting with large sets of data in Hadoop and being able to do iterative, ad-hoc querying.

IW: Hadoop 2.0 introduces a lot of change. How quickly will it be embraced and what's going to drive adoption?

Connolly: The Hadoop 2.0 initiatives have been underway for more than three years. If you take Yahoo, all of its clusters have been moved to Hadoop 2.0 with YARN. It has been running in production for six to nine months. They were doing rollouts and testing a year before that. We're pretty confident that it's ready and bulletproof from an enterprise perspective. The level of quality will be what we're used to shipping.

When you look at new analytic applications and large, data-lake scenarios, many customers want to do MapReduce batch processing and interactive querying while using HBase as an online database. They want to be able to run those all on a cluster that has legitimate resource management [with YARN]. Right now, HBase clusters are typically separated from other Hadoop workloads primarily because you don't want an online data store competing for resources. That's not a fit with the Hadoop goal of putting all of your data in one spot.

So [YARN] resource management will be a big driver of Hadoop 2.0 adoption, but companies will also double their headroom on their Hadoop clusters. Yahoo, for example, won't have to expand its clusters for a while because it freed up a bunch of room and can run twice the number of jobs that it previously handled.

IW: How was that much capacity unlocked?

Connolly: Yahoo now runs twice the number of jobs on the same servers because YARN is able to schedule the resource requirements of those jobs much more efficiently than the old, classic MapReduce. For medium-size clusters on up, that will definitely be a pull to deploy Hadoop 2.0. Customers can also expect Hive to take advantage of Tez and YARN for better interactive querying and HBase and other workloads will be able to co-exist in a single cluster, and that will be another draw to Hadoop 2.0.

2 of 3
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
8/21/2013 | 7:14:03 PM
re: Hadoop According To Hortonworks: An Insider's View
Besides Stinger and Impala, another Apache project that provides SQL-On-Hadoop interactive speed capabilities is Apache Drill pulling in insprations from Dremel and other projects, has been making great progress with collaboration with multiple companies as well. It's soon to make Alpha and has a very flexible architecture.

User Rank: Strategist
8/21/2013 | 1:34:23 AM
re: Hadoop According To Hortonworks: An Insider's View
Ari Zilka lead a massive, modernizing redeployment of the Java backend of Wal-Mart's web site (Mark Towfiq lead the user interaction side), then generalized the technology for Terracotta's in-memory data management system. Terracotta applied a big speed up to the way Java applications could handle data. He's the right successor to Hortonworks founding CTO Eric Baldeschweiler..
D. Henschen
D. Henschen,
User Rank: Author
8/20/2013 | 2:06:58 PM
re: Hadoop According To Hortonworks: An Insider's View
This was a long interview and I had to cut some good stuff. I pressed the point about Hortonwork's strategy in this exchange:

IW: Doesn't Hortonwork's strategy kind of put it in the background -- a
services company that takes a back seat to partners like Microsoft and

Connolly: It doesn't put us in the background. If you look at the Teradata Unified Data Architecture, our box is one of three that they advertise to the market as part of a best-of-breed big data architecture. We're a technology platform provider. We're not a database provider. We're not going to focus only on SQL; that's just one of the workloads that the platform can and should support. So when you say, "are we going to run out of gas on things that can be done around Hadoop," we think the party has just started. If you look at the number of committers that we have, there are 21 at Hortonworks versus seven or eight at Cloudera. That's just the Apache Hadoop project. We have approaching 80 direct committers across Hadoop, Hive, Pig and other projects, and we do the open source project releases in many of those. That's why we're valuable to our partners.
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
IT Success = Storage & Data Center Performance
Balancing legacy infrastructure with emerging technologies requires laying a solid foundation that delivers flexibility, scalability, and efficiency. Learn what the most pressing issues are, how to incorporate advances like software-defined storage, and strategies for streamlining the data center.
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of November 6, 2016. We'll be talking with the InformationWeek.com editors and correspondents who brought you the top stories of the week to get the "story behind the story."
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll