Cloudera Impala database for Hadoop joins Apache Spark in an accelerator program aimed at breeding big data applications.

Doug Henschen, Executive Editor, Enterprise Apps

November 10, 2014

4 Min Read
Results from a Cloudera benchmark test comparing Impala single-user and multi-user query performance with other SQL-on-Hadoop options.

10 Big Data Online Courses

10 Big Data Online Courses


10 Big Data Online Courses (Click image for larger view and slideshow.)

The Hadoop vendor Cloudera announced Monday that it is adding its Impala SQL-on-Hadoop database to a Cloudera Accelerator Program (CAP) designed to promote third-party development of big data applications.

CAP was introduced last month with a support initiative around Apache Spark analytic applications. Spark is an open-source in-memory platform that supports machine learning as well as streaming analysis, graph analysis, R-based analytics, and even SQL analysis. More than 20 third-party software vendors joined the program, with Alpine Data Labs, Platfora, RapidMiner, SAS, and Talend being among the most notable vendors.

In the case of Impala, development use cases have more of a business intelligence slant, and the list of 26 vendors joining the CAP includes Actuate, IBM Cognos, Information Builders, MicroStrategy, Pentaho, Tableau Software, and Tibco.

"The two big use cases we're seeing for Impala are aggregating data in Hadoop to present analytic dashboards and improving data-discovery applications by providing faster performance than Hive," Alex Gutow, Cloudera's product marketing manager, said in a phone interview with InformationWeek.

[Is this big data platform a must have? Read Hadoop 'No Longer Optional,' Says Forrester.]

Cloudera was among the first vendors to provide an alternative to Apache Hive, Hadoop's native SQL query option, when it introduced Impala 18 months ago. Today there are at least a dozen SQL-on-Hadoop alternatives to Hive, including relational databases ported to run on top of Hadoop -- like options from Actian (SQL Edition/ParAccel), HP (Vertica), and Pivotal (HAWQ/Greenplum) -- and Hadoop-native projects -- like Apache Drill, Hortonwork's Hive-on-Tez Stinger initiative, Facebook's Presto query engine, and Spark SQL.

"Unlike Impala, the databases running on Hadoop don't have the flexibility of the schema-on-read approach. They lack support for Hadoop file formats and YARN, and they don't integrate with Hadoop security and data-governance technologies," Gutow said.

That's a Hadoop-centric view of the world, whereas database vendors, like HP and Pivotal, would likely respond that they have their own, far more familiar SQL-based options for security and data governance.

As for those Hadoop-native rivals, "they aren't as performant as Impala, and they don't offer its compatibility with [popular] BI solutions," said Gutow.

Certainly, nothing matches Hive on compatibility, as Hadoop integration often starts with this query tool, but Cloudera cites its own recent benchmark tests and a third-party study by IBM as proof of superior, database-like Impala query speeds over the original Hive (using MapReduce), the Hive that Hortonworks is developing on Tez, Presto, and Spark SQL.

Why promote Spark through the Cloudera Accelerator Program but then show up Spark SQL performance in a benchmark test? Because Impala and Spark "solve different problems," and a one-size-fits-all approach won't work, Gutow said.

"Impala shines if you're looking for broad, multi-user applications and SQL compatibility with other tools," she said. "Spark SQL is more likely to be used within Spark applications where you need to add a bit of SQL into a mix of analysis approaches."

Asked for clear evidence of practitioner uptake of Impala, Cloudera provided a list of a dozen publicly namable customers, including the insurer Allstate, the data provider Epsilon, and the software developer and systems integrator CSC. Cloudera further claims that "hundreds" of customers have deployed Impala, and that the software has had "more than 1 million" downloads. Cloudera won't say whether the customer count is closer to 100 or 999.

To sum up what's going on with Hadoop these days, the race is on to come up with better, faster alternatives to MapReduce and Hive -- the original but slow and clunky data-processing and query tools native to the platform. The core value of Hadoop as a versatile, low-cost storage platform is proven. But there are many more options than dominant winners for data-processing, query, and analysis at this point.

Apply now for the 2015 InformationWeek Elite 100, which recognizes the most innovative users of technology to advance a company's business goals. Winners will be recognized at the InformationWeek Conference, April 27-28, 2015, at the Mandalay Bay in Las Vegas. Application period ends Jan. 16, 2015.

About the Author(s)

Doug Henschen

Executive Editor, Enterprise Apps

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of Transform Magazine, and Executive Editor at DM News. He has covered IT and data-driven marketing for more than 15 years.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights