Cloudera Plans Data Hub Role For Hadoop

Cloudera prepares to make Hadoop a central hub and first destination for companies' data. SAS and Apache Spark add data-analysis options.

Doug Henschen, Executive Editor, Enterprise Apps

October 29, 2013

6 Min Read

The nub of the big news from Cloudera at this week's Strata/Hadoop World event in New York is the beta release of Cloudera Enterprise 5, the latest version of the vendor's Hadoop platform.

The bigger picture is what Cloudera calls the Enterprise Data Hub, which is how the company says Hadoop is now being used by advanced practitioners and how it will be used by most customers within a few years.

"We've moved from batch storage [and processing] at scale to batch plus real-time analytics and real-time access to data with security, access control and audit logging capabilities," said Mike Olson, Cloudera's chief strategy officer, in an interview with InformationWeek. "That has a growing number of our customers deploying Hadoop at the center of their data centers as the first place data goes when it enters the enterprise, rather than at the side of the data center to solve a few, ancillary problems."

The vision sounds reminiscent of Olson's assertion earlier this year that "the center of gravity is shifting"away from data warehousing and toward Hadoop. But the Enterprise Data Hub is a more holistic view of how Cloudera's platform fits in an even broader view of data management.

[ Want more on Cloudera's evolutionary expectations? Read Cloudera Declares End Of Data Warehousing Era. ]

"With an Enterprise Data Hub, information is cheap to store, and you can keep full-fidelity [unaggregated] data forever if you want to," Olson explained. "You can do your ETL and your data cleaning and preening on this new platform and deliver derived data sets to special-purpose data warehouses and document management systems for advanced processing there."

The idea is to keep the broadest and deepest swath of data on Hadoop -- or more correctly, Cloudera's commercially enhanced version of Hadoop -- and use Cloudera Impala SQL capabilities, Cloudera Search, and Cloudera Navigator access management and auditing to take over the broad and high-scale workloads from traditional database systems such as data integration, data warehouse and document management systems.

The Enterprise Data Hub vision will bring changes to Cloudera's product packaging and product pricing. The details have yet to be spelled out, but if the Enterprise Data Hub is to live up to its name, Olson said it has to meet enterprise-grade expectations for security, encryption, access control, logging, data lineage and more. That's where many of Cloudera's commercial options and components come in.

In the current packaging regime, Cloudera Enterprise 4 bundles open source CDH (Cloudera's distribution including Apache Hadoop) and Cloudera Manager (the vendor's commercial deployment, management and monitoring software) with commercial support. Anything beyond that was optional, including support and management capabilities for Apache HBase, Cloudera Impala, Cloudera Search, and advanced backup and disaster recovery.

When Cloudera Enterprise 5 becomes generally available early next year, Olson said we can expect an Enterprise Data Hub offering that will roll all of those a la carte options into one, comprehensive offering.

Extending The Hub

The Enterprise Data Hub vision also encompasses data-processing and data-analysis options from third-party vendors. On this front the company highlighted Cloudera Enterprise 5 support for deploying, managing and monitoring Hadoop-compatible products from partners including SAS, Revolution Analytics, Informatica, Syncsort and others. Support for SAS Scoring and SAS Visual Analytics on Cloudera's Hadoop platform is particularly significant given the vast pool of SAS-savvy data analytics professionals.

Reaching out to startups, Cloudera announced an Innovators program that will see DataBricks, the company behind Apache Spark, as an inaugural member. Spark is an in-memory analytics framework that runs on Hadoop. Now in open source, the technology is being developed and commercially supported by DataBricks, which was spun out of AMPLab at the University of California, Berkeley.

As part of the Innovators program, Cloudera announced direct support for Apache Spark within CDH, and Olson said the company will also provide technical resources and front-line support.

[ Want to hear the counterpoint view? Read Big Data Debate: Will Hadoop Become Dominant Platform? ]

In other announcements tied to the Enterprise 5 beta release, Cloudera announced new capabilities including:

-- In-memory HDFS caching designed to boost Map/Reduce processing performance and Cloudera Impala query response times,
-- User-defined functions (UDFs) that let users store custom functions for use with Cloudera Impala and the open source MADlib statistical and analytic library for in-database analytics,
-- YARN and Cloudera Manager resource management tools designed to enable administrators to allocate resources by workload and workgroup,
-- Centralized data-auditing and data-lineage features to satisfy governance and compliance requirements,
-- New cloud-based Cloudera services from Savvis (a CenturyLink company), SoftLayer (an IBM company), T-Systems and Verizon Cloud. Cloudera services were already available through Amazon Web Services.

Commercial Vs. Open Source

Cloudera's biggest competitor, Hortonworks, will likely point to the many spots where Cloudera's Enterprise Data Hub vision leads to commercial software. Olson insisted that "everything at the core and at the platform level is open source." Where the company has proprietary software -- Cloudera Manager, Cloudera Navigator and some aspects of backup and disaster recovery -- Olson said the company is delivering unique value.

"If it's about manageability, deploying and operating the software at scale, or about getting added value from the data, we reserve the right to make that proprietary," he said.

In some cases, "core" options that are open source, like Cloudera Impala and Cloudera Search, would be almost impossible to run at scale without Cloudera Manager to provision, manage and monitor the workloads and Cloudera Navigator to provide access control and auditing.

Will these differences even matter to the many Hadoop users who are still tinkering with clusters and experimenting with Map/Reduce? Or has the platform really matured to the point where many companies are ready for the Enterprise Data Hub vision?

"We don't see [Enterprise Data Hub-style deployments] ubiquitously, but in most of our longstanding accounts and among all of our very largest customers, we absolutely see this happening," Olson said. "I expect most of our installed base to be deployed in this way within the next couple of years."

That's optimistic. As for the licensing, packaging and pricing differences, we have yet to hear the details. But it sounds like we can expect an all-you-can-eat Enterprise Data Hub licensing and support approach. That, plus a harder push to extend the community and tightly integrate Cloudera's flavor of Hadoop with third-party data-management and analytics offerings from SAS and similar companies.

IT leaders must know the trade-offs they face to get NoSQL's scalability, flexibility and cost savings. Also in the When NoSQL Makes Sense issue of InformationWeek: Oregon's experience building an Obamacare exchange. (Free registration required.)

About the Author(s)

Doug Henschen

Executive Editor, Enterprise Apps

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of Transform Magazine, and Executive Editor at DM News. He has covered IT and data-driven marketing for more than 15 years.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights