Software // Information Management
News
4/30/2013
08:04 AM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Cloudera Impala Brings SQL Querying To Hadoop

Cloudera's SQL-on-Hadoop tool hits general release, but will it satisfy demands for faster, easier exploration of big data?

5 Big Wishes For Big Data Deployments
5 Big Wishes For Big Data Deployments
(click image for larger view and for slideshow)
Cloudera on Tuesday announced the general release of its Impala query engine for Hadoop after six months of beta testing by more than 40 customers.

It's the first so-called SQL-on-Hadoop product to reach general release. But with a bevy of such systems on the way -- including options from IBM (Big SQL), Hortonworks (Stinger), MapR (Drill), Pivotal (HAWQ) and Teradata (SQL-H) -- the question is whether Impala will stand out as the best fix for Hadoop's shortcomings.

Companies are embracing Hadoop for its high-scale storage capacity, relative low cost (compared to relational databases at scale) and its ability to quickly ingest new and variable data types without the need to transform it all to a rigid, predefined schema.

Hadoop's biggest shortcoming is that most analysis is done through slow, batch-oriented and hard-to-code MapReduce processing. Apache Hive data warehousing infrastructure offers limited SQL querying capabilities, but it too relies on MapReduce behind the scenes, so it's much slower than conventional database querying.

[ Want more on SQL querying of big data? Read Teradata Joins SQL-On-Hadoop Bandwagon. ]

Impala supports direct querying of data in the Hadoop Distributed File System (HDFS) and HBase (NoSQL database) indexes, and Cloudera claims it's three to 30 times faster than Hive. Beta customers report results that are falling into that range. Six3 Systems, for example, a systems integrator serving federal agencies, has seen querying at least 14 times faster than Hive, according to analytics developer Wayne Wheeles.

"Just dealing with one day's worth of data on one system -- about 20 million records -- running one of my analytics using Hive took 82 seconds whereas running it on Impala took less than six seconds," Wheeles told InformationWeek.

The core Impala query engine ships under Apache license. It is "nearly" SQL standard compliant via Hive SQL. That means it falls short of full ANSI SQL support. But Impala does include ODBC and JDBC drivers and is supported by business intelligence systems from Alteryx, Karmasphere, Microstrategy, Pentaho, Qliktech and Tableau Software. An optional Cloudera Enterprise Real-Time Query (RTQ) subscription adds an administrative module for deploying, managing and monitoring Impala and its query performance.

Despite the wave of SQL-on-Hadoop project announcements in recent months, Cloudera CEO Mike Olson told InformationWeek that Impala won't be surpassed by rivals such as HAWQ from EMC/VMware spinoff Pivotal or Hortonwork's Stinger project.

"The announcement out of Pivotal is just a port of a decade-old technology [meaning Greenplum database] that has its own, independent, non-integrated schema layer that can't share information with the rest of the [Hadoop] platform," Olson said. As for Stinger, Olson pointed out that that project relies on Hive and MapReduce, "and we don't believe that it's going to be possible to drive down latencies and improve performance sufficiently via that platform."

Impala is clearly faster than Hive, but Cloudera said from the beginning that it's not a replacement for conventional data warehouses when workloads involve demanding service-level agreements or multi-dimensional (cube) analyses. Nevertheless, Olson insists that Impala will enable many organizations to shift a significant share of data and query workloads over to Hadoop, where Cloudera asserts that managing data at high scale costs anywhere from 10% to 1% of the cost of doing so in a conventional data warehouse.

Impala lacks flexible data-exploration capabilities, so companies will have to put forethought into which queries and data they put on Hadoop and which they keep on conventional warehouses. That's one drawback, along with incomplete SQL support, that Cloudera competitor MapR says it's hoping to avoid with Apache Drill, the project it has on track for beta release in the third quarter.

Full SQL support is needed "so you don't have SQL-related application errors that get traced back to narrow differences between SQL and SQL-like functionality," Jack Norris, VP of marketing at MapR, told InformationWeek. "We also want to support schema on discovery, so rather than dictating [querying] in advance, we want to be flexible and discover on the fly."

Exploratory analytics is a work in progress with Impala, Olson admitted. "It is a bit of a pain to have to think about [those choices] in advance, but it's a pain that already exists because one of the biggest costs of a conventional data warehouse is up-front schema definition," Olson said. "You'd rather have just one big sea of data and get all the capabilities of every platform everywhere, but we're not there yet."

The key advantage that Cloudera has with Impala, for now, is that it's already available, whereas it might be months before competitors move out of beta. More important, Cloudera has a larger base of support customers and software users than any other Hadoop distribution, and that's where the rubber meets the road in adoption.

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Michael Hausenblas
50%
50%
Michael Hausenblas,
User Rank: Apprentice
5/1/2013 | 4:21:30 PM
re: Cloudera Impala Brings SQL Querying To Hadoop
Ack. You have not seen this demand ;)
Todd Lipcon
50%
50%
Todd Lipcon,
User Rank: Apprentice
5/1/2013 | 3:09:18 PM
re: Cloudera Impala Brings SQL Querying To Hadoop
Hi Michael. Impala also has support for querying HBase. As far as I'm aware we have not seen any demand for querying other non-Hadoop systems like MySQL (which of course has its own query parser and execution anyway).
Michael Hausenblas
50%
50%
Michael Hausenblas,
User Rank: Apprentice
5/1/2013 | 4:43:42 AM
re: Cloudera Impala Brings SQL Querying To Hadoop
Indeed. That is the beauty of Apache Drill. Not only the flexibility concerning the data formats supported, but also the type of storage. Impala is confined to HDFS(-based stuff), in Apache Drill you can use HBase or MySQL or CouchDB or G«™
Michael Hausenblas
50%
50%
Michael Hausenblas,
User Rank: Apprentice
5/1/2013 | 4:41:03 AM
re: Cloudera Impala Brings SQL Querying To Hadoop
Doug, thank you for your write-up! I wonder why you can't be bothered to invest the two minutes to also provide a link to Apache Drill? It's http://incubator.apache.org/dr... for the record and you're welcome.

Secondly, I find it poor research to state that Apache Drill == MapR. Yes, we're certainly supporting it, however, in case you haven't figured it yet, it's an *Apache* Incubator project that is driven and defined by the community, MapR being part of it. See also one of my recent status reports at https://speakerdeck.com/mhause... containing a shout out to some of the core contributors (yes, with two people from MapR, alright, I'm not hiding the fact).

Look, I give it Cloudera that they have made Impala now GA, but it's funny to learn that they avoided the 'hard way' (achieving community consensus on the APIs via Apache) and shipped a proprietary product. I believe in the power of open APIs and that's why I think, Apache Drill will be successful, eventually. As an additional fun fact: ask Oracle why they don't ship Impala as part of their partnership with Cloudera?

Cheers,
Michael
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
4/30/2013 | 11:50:36 PM
re: Cloudera Impala Brings SQL Querying To Hadoop
Another weakness of Impala, according to competitors, is flexibility in handing a range of data formats. Countering this claim, Cloudera talked up support for both Parquet (compression) and Avro-supported file formats. Any data wonks have comment on whether Cloudera has effectively addressed the need for handling a variety of data formats?
cbabcock
50%
50%
cbabcock,
User Rank: Strategist
4/30/2013 | 11:04:35 PM
re: Cloudera Impala Brings SQL Querying To Hadoop
Cloudera remains focused and out front on tools that help users adopt Hadoop.If its Impala has a true performance advantage, that will tend to keep it out front. Time and user experience will tell. Charlie Babcock, editor at large, InformationWeek
Deirdre Blake
50%
50%
Deirdre Blake,
User Rank: Apprentice
4/30/2013 | 5:07:16 PM
re: Cloudera Impala Brings SQL Querying To Hadoop
Kudos to Cloudera for getting there first, it's a step in the right direction.
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July14, 2014
Our new survey shows growing demand, flat budgets, and CIOs looking to cloud providers -- not to offload services, but to steal ideas.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join InformationWeekís Lorna Garey and Mike Healey, president of Yeoman Technology Group, an engineering and research firm focused on maximizing technology investments, to discuss the right way to go digital.
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.