Hadoop According To Hortonworks: An Insider's View

Shaun Connolly, Hortonworks VP of corporate strategy, dishes on Hadoop 2.0., competition with Cloudera and the threat of big data commoditization.

Doug Henschen, Executive Editor, Enterprise Apps

August 19, 2013

12 Min Read

Shaun Connolly, Hortonworks

Shaun Connolly, Hortonworks

Shaun Connolly, Hortonworks

Hortonworks recently marked its second year in business and its first year of offering a distribution of Hadoop open-source software and related commercial support services. Next up, within a matter of weeks, will be the next release of the Hortonworks Data Platform, incorporating next-generation Hadoop 2.0.

YARN(a not-quite acronym for Yet Another Resource Manager) is a crucial new open source component that will improve Hadoop performance and move it beyond the confines of batch MapReduce processing. Work is also underway, as part of the Horton-supported Stinger project, to deliver a higher-performance, more SQL-compatible version of Hive. SQL-on-Hadoop capabilities are just one area in which Hortonworks is in a pitched competitive battle with Cloudera. While Hortonworks waits to ship foundation-approved open-source software, Cloudera has added Impala and other components to Hadoop that are best administered through its commercial management software.

Can Hortonworks innovate and build the value of its company, or is the company's "100% open source" strategy vulnerable to commoditization as the Hadoop platform matures? Shaun Connolly, Hortonwork's VP of Corporate Strategy, spoke with InformationWeekabout a range of topics including CTO Eric Baldeschwieler's recent departure, prospects for Hadoop 2.0, acquisition rumors and the company's long-range plans.

InformationWeek: Hortonworks presents itself as the company that promotes Hadoop as an "enterprise viable platform," but isn't that a foregone conclusion at this point?

Shaun Connolly:I think that mission has a lot more legs. If I draw a corollary to how the Linux market played out, Linux started out with some very targeted workloads. Hadoop, in its first generation, was clearly batch-oriented MapReduce processing. As Linux matured you got secure Linux and virtualized Linux, and the platform took on a lot more mission-critical workloads. That's what we're seeing with Hadoop. With YARN, other types of workloads will be able snap into Hadoop and be coordinated on the same platform.

IW: On the personnel front, Hortonworks' co-founder, CTO and former CEO Eric Baldeschwieler recently left the company. Have you selected a new CTO?

Connolly: Our new CTO is Ari Zilka, who was the CTO and one of the founders of Terracotta, which is an in-memory data-management technology that's now a part of SoftwareAG. Ari was previously at Walmart, where he deployed massive-scale data systems. Ari has been at Hortonworks for almost a year and a half, and he has worn mostly a field-CTO-type hat as chief product officer. He has also helped customers leverage Hadoop and integrate it with lower-latency architectures.

IW: How big of a technical depth hole did Eric Baldeschwieler's departure leave?

Connolly:We've effectively grown 10X since our founding in terms of number of employees We started with about 24 engineers from Yahoo, including Eric. Eric has chosen to move on and do other things, and that was a personal choice. The rest of the core team from Yahoo -- Arun Murthy, Owen O'Malley, Alan Gates, Sanjay Radia, Suresh Srinivas and Mahadev Konar and others -- are all active in their projects and are Hortonworks employees.

We've grown from those Yahoo roots and have a good many engineers from Oracle, IBM and MySQL. We also have folks from Microsoft and SAP as well as Amazon and Google. We have a good mix from Web-scale companies as well as enterprise software developers. Greg Pavlik [VP of engineering] in particular has been able to attract a bunch of folks because he spent many years at Oracle.

IW: What's your customer base like these days and what are the primary use cases you're seeing for Hadoop?

Connolly:We ended the last quarter with more than 120 customers. We're actively working with customers across Web retail, media, telco and healthcare. We see a fair amount in the Web and retail spaces, including brick-and-motor retailers. They'll typically get started with analytic applications taking advantage of new data sources, like clickstreams, social sentiment and devices. With clickstreams and social they're after the classic 360-degree customer view.

Hadoop offers a more economical solution where these customers can store way more data. In the case of healthcare, they're after a 360-degree view of the patient, and we're seeing electronic medical records applications as well as uses in pharma around manufacturing analytics.

IW: Are these net-new applications, or were these things firms were trying to do but without much success with relational databases?

Connolly: They were trying to do it, in many cases, but they had sprawl of systems and they could never tag one system as the place where they could pull all of that information together. They tended to have an incomplete view, and they were always focused on looking only at, say, 30 to 60 days of data when they had to put it in a traditional data warehouse. The cost structures of data warehouses are anywhere from 10 to 100 times higher than what they can drive per terabyte on a Hadoop cluster. Now they can store multiple years of data, not just a month or two.

IW: How does Hadoop fit into the overall enterprise data architecture?

Connolly: Companies will get started with one or two targeted applications around, say, clickstreams or sensors as the new form of data that they weren't able to use before. But I'd say a good 70% of firms are doing their initial deployments with an eye toward Hadoop as a shared data service or data lake. They want to get to that, but it tends to be a multi-year process. Within three to six months they might be able to develop one or two analytic applications.

IW: What are the other 30% doing?

Connolly: They're staying focused on very targeted analytic applications. It might be an online marketing company where they have two or three use cases. That's what they use Hadoop for and they don't need a massive data lake where they might try to drive productivity or efficiency out of other parts of the business.

IW: Speaking of analytics, there's been a big focus on SQL on Hadoop capabilities this year. Cloudera introduced Impala on this front, but Hortonworks' choice is to improve Hive with the Stinger project. What's the latest update?

Connolly: Hive up through release 0.10 was built to run on classic, batch-oriented MapReduce. Work on Hive 0.11 has optimized Hive's use of MapReduce and how it goes about generating MapReduce jobs. It has also begun to find where Hive can be smarter by, for example, spotting queries that are right for MapReduce and those that require interactive response times. Tez [a component of Hadoop 2.0] is designed to speed up the performance of Hive in those scenarios. With Stinger, the goal is to improve Hive's performance 100X, and we're probably about halfway there.

IW: What is Tez all about? That's a very new component.

Connolly: Tez is an Apache incubator project, and its job is to alleviate the MapReduce notion of having intermediate files stored and replicated three times in HDFS. That introduces a lot of latency. Tez lets you do a lot of MapReduce in memory, so you don't have to persist data in the file system at every step. Tez is better suited for SQL-type access, and Hive, in Stinger Phase 2, is taking advantage of it -- as are Apache Pig and some of the other projects in the ecosystem.

Ultimately Tez will be an always-on service so you don't have to iteratively boot up Java, run your job and shut it down. Tez will always be there to field requests and handle a lot of processing in memory, and it will be far faster than classic MapReduce. That underpins our Stinger initiative. Hive does petabyte-scale data processing extremely well. We're confident we can enable Hive to take advantage of Tez so it can handle the human-interactive use cases very well and still have just one system using a standard set of SQL.

IW: The complaints about Hive have been about the lack of SQL-like functionality as well as slow query performance. Will Stinger address both of these flaws?

Connolly: On performance, we're talking about a handful of seconds -- human-interactive response times. It's not going to be like an OLTP relational database that delivers millisecond response. That's not the goal of Stinger. It's designed for humans interacting with large sets of data in Hadoop and being able to do iterative, ad-hoc querying.

IW: Hadoop 2.0 introduces a lot of change. How quickly will it be embraced and what's going to drive adoption?

Connolly: The Hadoop 2.0 initiatives have been underway for more than three years. If you take Yahoo, all of its clusters have been moved to Hadoop 2.0 with YARN. It has been running in production for six to nine months. They were doing rollouts and testing a year before that. We're pretty confident that it's ready and bulletproof from an enterprise perspective. The level of quality will be what we're used to shipping.

When you look at new analytic applications and large, data-lake scenarios, many customers want to do MapReduce batch processing and interactive querying while using HBase as an online database. They want to be able to run those all on a cluster that has legitimate resource management [with YARN]. Right now, HBase clusters are typically separated from other Hadoop workloads primarily because you don't want an online data store competing for resources. That's not a fit with the Hadoop goal of putting all of your data in one spot.

So [YARN] resource management will be a big driver of Hadoop 2.0 adoption, but companies will also double their headroom on their Hadoop clusters. Yahoo, for example, won't have to expand its clusters for a while because it freed up a bunch of room and can run twice the number of jobs that it previously handled.

IW: How was that much capacity unlocked?

Connolly: Yahoo now runs twice the number of jobs on the same servers because YARN is able to schedule the resource requirements of those jobs much more efficiently than the old, classic MapReduce. For medium-size clusters on up, that will definitely be a pull to deploy Hadoop 2.0. Customers can also expect Hive to take advantage of Tez and YARN for better interactive querying and HBase and other workloads will be able to co-exist in a single cluster, and that will be another draw to Hadoop 2.0. IW: Cloudera clearly has a different strategy of building value on top of Hadoop. Can Hortonworks deliver enough value supporting foundation-sanctified software?

Connolly: We're part of the Apache Foundation and we're also part of the OpenStack Foundation. Those community models are important because they enable a lot more people to participate, and that drives the technology forward. Our fundamental belief is that the more people get involved, it doesn't slow things down, it results in higher quality technology. That's why we do things like the Stinger initiative, rallying the Apache community around more performance and SQL compliance. We reached out to Facebook, Microsoft and others who have engineers actively coding to help the Stinger initiative to achieve those goals. It isn't just Hortonworks.

What's our value? It's providing that type of leadership around things like Apache Knox for security, Apache Falcon for data lifecycle management. It isn't just Apache Hadoop, the project, much like it isn't just about Linux kernel. It's about the variety of data services and operational services that come in around Hadoop.

If we were going after this market as if we wanted to be the next Oracle for big data, then, yes, we would probably be doing commercial extensions of Hadoop and trying to differentiate ourselves that way. We think Hadoop's opportunity is much broader than that. When you see the likes of Microsoft telling the world that the Hortonworks Data Platform for Windows on-premises is the Hadoop you want, and that it's compatible with the Azure HD Insight Service, that pulls the technology into the market at a much faster pace than if we did everything ourselves or created commercial extensions to Hadoop.

IW: Intel is talking about putting Hadoop software on chips. Does that hint that Hadoop is becoming very standardized, commodity type technology?

Connolly: That's really a great question about Intel's strategy. We've seen some of their engineers working on some of the Hadoop projects, like HBase and security, but how does that translate into an Intel offering that is credible and how much they can push down onto the chip? I don't have that answer. Intel commits a lot to Linux, but an operating system is vastly different than a distributed data processing system. If Hadoop can take advantage of security features and other things that are built into chips, that's good. But it remains to be seen how that plays out.

IW: There were rumors that Intel was interested in buying Hortonworks, just as there were rumors that Microsoft took a run at the company last year. What is the end game for Hortonworks?

Connolly: Let me address that very specifically because there has also been speculation about our funding. Our series A round raised $23 million. Series B raised $25 million and the latest round, concluded in June, raised $50 million. We are running the company with the goal of becoming the dominant force in the next-generation data platform space. With the latest round of funding, our goal is to begin to get to cash-flow neutrality and profitability. That's how you prepare yourself to have the option to become a publicly traded company.

As far as acquisitions are concerned, rumors make for good Silicon Valley reality TV and discussion, but there have been no real offers for Hortonworks. People talk and stuff gets printed. The Microsoft acquisition assertion is patently false. It never happened. The Intel rumor came in advance of our latest funding round. Maybe somebody's signals got crossed and they made up that story.

At the end of the day, we're well capitalized and we're executing very well. Our goal is to drive Hortonworks as an independently run company. Acquisitions happen all the time, but we're not in the business to position ourselves to sell the company. There's a bigger opportunity than a quick-flip scenario.

About the Author(s)

Doug Henschen

Executive Editor, Enterprise Apps

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of Transform Magazine, and Executive Editor at DM News. He has covered IT and data-driven marketing for more than 15 years.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights