Cloudera Impala Brings SQL Querying To Hadoop

Cloudera's SQL-on-Hadoop tool hits general release, but will it satisfy demands for faster, easier exploration of big data?
5 Big Wishes For Big Data Deployments
5 Big Wishes For Big Data Deployments
(click image for larger view and for slideshow)
Cloudera on Tuesday announced the general release of its Impala query engine for Hadoop after six months of beta testing by more than 40 customers.

It's the first so-called SQL-on-Hadoop product to reach general release. But with a bevy of such systems on the way -- including options from IBM (Big SQL), Hortonworks (Stinger), MapR (Drill), Pivotal (HAWQ) and Teradata (SQL-H) -- the question is whether Impala will stand out as the best fix for Hadoop's shortcomings.

Companies are embracing Hadoop for its high-scale storage capacity, relative low cost (compared to relational databases at scale) and its ability to quickly ingest new and variable data types without the need to transform it all to a rigid, predefined schema.

Hadoop's biggest shortcoming is that most analysis is done through slow, batch-oriented and hard-to-code MapReduce processing. Apache Hive data warehousing infrastructure offers limited SQL querying capabilities, but it too relies on MapReduce behind the scenes, so it's much slower than conventional database querying.

[ Want more on SQL querying of big data? Read Teradata Joins SQL-On-Hadoop Bandwagon. ]

Impala supports direct querying of data in the Hadoop Distributed File System (HDFS) and HBase (NoSQL database) indexes, and Cloudera claims it's three to 30 times faster than Hive. Beta customers report results that are falling into that range. Six3 Systems, for example, a systems integrator serving federal agencies, has seen querying at least 14 times faster than Hive, according to analytics developer Wayne Wheeles.

"Just dealing with one day's worth of data on one system -- about 20 million records -- running one of my analytics using Hive took 82 seconds whereas running it on Impala took less than six seconds," Wheeles told InformationWeek.

The core Impala query engine ships under Apache license. It is "nearly" SQL standard compliant via Hive SQL. That means it falls short of full ANSI SQL support. But Impala does include ODBC and JDBC drivers and is supported by business intelligence systems from Alteryx, Karmasphere, Microstrategy, Pentaho, Qliktech and Tableau Software. An optional Cloudera Enterprise Real-Time Query (RTQ) subscription adds an administrative module for deploying, managing and monitoring Impala and its query performance.

Despite the wave of SQL-on-Hadoop project announcements in recent months, Cloudera CEO Mike Olson told InformationWeek that Impala won't be surpassed by rivals such as HAWQ from EMC/VMware spinoff Pivotal or Hortonwork's Stinger project.

"The announcement out of Pivotal is just a port of a decade-old technology [meaning Greenplum database] that has its own, independent, non-integrated schema layer that can't share information with the rest of the [Hadoop] platform," Olson said. As for Stinger, Olson pointed out that that project relies on Hive and MapReduce, "and we don't believe that it's going to be possible to drive down latencies and improve performance sufficiently via that platform."

Impala is clearly faster than Hive, but Cloudera said from the beginning that it's not a replacement for conventional data warehouses when workloads involve demanding service-level agreements or multi-dimensional (cube) analyses. Nevertheless, Olson insists that Impala will enable many organizations to shift a significant share of data and query workloads over to Hadoop, where Cloudera asserts that managing data at high scale costs anywhere from 10% to 1% of the cost of doing so in a conventional data warehouse.

Impala lacks flexible data-exploration capabilities, so companies will have to put forethought into which queries and data they put on Hadoop and which they keep on conventional warehouses. That's one drawback, along with incomplete SQL support, that Cloudera competitor MapR says it's hoping to avoid with Apache Drill, the project it has on track for beta release in the third quarter.

Full SQL support is needed "so you don't have SQL-related application errors that get traced back to narrow differences between SQL and SQL-like functionality," Jack Norris, VP of marketing at MapR, told InformationWeek. "We also want to support schema on discovery, so rather than dictating [querying] in advance, we want to be flexible and discover on the fly."

Exploratory analytics is a work in progress with Impala, Olson admitted. "It is a bit of a pain to have to think about [those choices] in advance, but it's a pain that already exists because one of the biggest costs of a conventional data warehouse is up-front schema definition," Olson said. "You'd rather have just one big sea of data and get all the capabilities of every platform everywhere, but we're not there yet."

The key advantage that Cloudera has with Impala, for now, is that it's already available, whereas it might be months before competitors move out of beta. More important, Cloudera has a larger base of support customers and software users than any other Hadoop distribution, and that's where the rubber meets the road in adoption.