Cloudera Boosts Hadoop App Development On Impala - InformationWeek
IoT
IoT
Data Management // Big Data Analytics
News
11/10/2014
12:35 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Cloudera Boosts Hadoop App Development On Impala

Cloudera Impala database for Hadoop joins Apache Spark in an accelerator program aimed at breeding big data applications.

10 Big Data Online Courses
10 Big Data Online Courses
(Click image for larger view and slideshow.)

The Hadoop vendor Cloudera announced Monday that it is adding its Impala SQL-on-Hadoop database to a Cloudera Accelerator Program (CAP) designed to promote third-party development of big data applications.

CAP was introduced last month with a support initiative around Apache Spark analytic applications. Spark is an open-source in-memory platform that supports machine learning as well as streaming analysis, graph analysis, R-based analytics, and even SQL analysis. More than 20 third-party software vendors joined the program, with Alpine Data Labs, Platfora, RapidMiner, SAS, and Talend being among the most notable vendors.

In the case of Impala, development use cases have more of a business intelligence slant, and the list of 26 vendors joining the CAP includes Actuate, IBM Cognos, Information Builders, MicroStrategy, Pentaho, Tableau Software, and Tibco.

"The two big use cases we're seeing for Impala are aggregating data in Hadoop to present analytic dashboards and improving data-discovery applications by providing faster performance than Hive," Alex Gutow, Cloudera's product marketing manager, said in a phone interview with InformationWeek.

[Is this big data platform a must have? Read Hadoop 'No Longer Optional,' Says Forrester.]

Cloudera was among the first vendors to provide an alternative to Apache Hive, Hadoop's native SQL query option, when it introduced Impala 18 months ago. Today there are at least a dozen SQL-on-Hadoop alternatives to Hive, including relational databases ported to run on top of Hadoop -- like options from Actian (SQL Edition/ParAccel), HP (Vertica), and Pivotal (HAWQ/Greenplum) -- and Hadoop-native projects -- like Apache Drill, Hortonwork's Hive-on-Tez Stinger initiative, Facebook's Presto query engine, and Spark SQL.

"Unlike Impala, the databases running on Hadoop don't have the flexibility of the schema-on-read approach. They lack support for Hadoop file formats and YARN, and they don't integrate with Hadoop security and data-governance technologies," Gutow said.

That's a Hadoop-centric view of the world, whereas database vendors, like HP and Pivotal, would likely respond that they have their own, far more familiar SQL-based options for security and data governance.

As for those Hadoop-native rivals, "they aren't as performant as Impala, and they don't offer its compatibility with [popular] BI solutions," said Gutow.

Certainly, nothing matches Hive on compatibility, as Hadoop integration often starts with this query tool, but Cloudera cites its own recent benchmark tests and a third-party study by IBM as proof of superior, database-like Impala query speeds over the original Hive (using MapReduce), the Hive that Hortonworks is developing on Tez, Presto, and Spark SQL.

Results from a Cloudera benchmark test comparing Impala single-user and multi-user query performance with other SQL-on-Hadoop options.
Results from a Cloudera benchmark test comparing Impala single-user and multi-user query performance with other SQL-on-Hadoop options.

Why promote Spark through the Cloudera Accelerator Program but then show up Spark SQL performance in a benchmark test? Because Impala and Spark "solve different problems," and a one-size-fits-all approach won't work, Gutow said.

"Impala shines if you're looking for broad, multi-user applications and SQL compatibility with other tools," she said. "Spark SQL is more likely to be used within Spark applications where you need to add a bit of SQL into a mix of analysis approaches."

Asked for clear evidence of practitioner uptake of Impala, Cloudera provided a list of a dozen publicly namable customers, including the insurer Allstate, the data provider Epsilon, and the software developer and systems integrator CSC. Cloudera further claims that "hundreds" of customers have deployed Impala, and that the software has had "more than 1 million" downloads. Cloudera won't say whether the customer count is closer to 100 or 999.

To sum up what's going on with Hadoop these days, the race is on to come up with better, faster alternatives to MapReduce and Hive -- the original but slow and clunky data-processing and query tools native to the platform. The core value of Hadoop as a versatile, low-cost storage platform is proven. But there are many more options than dominant winners for data-processing, query, and analysis at this point.

Apply now for the 2015 InformationWeek Elite 100, which recognizes the most innovative users of technology to advance a company's business goals. Winners will be recognized at the InformationWeek Conference, April 27-28, 2015, at the Mandalay Bay in Las Vegas. Application period ends Jan. 16, 2015.

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Lorna Garey
50%
50%
Lorna Garey,
User Rank: Author
11/11/2014 | 10:14:55 AM
Re: Proprietary format?
What I meant was, start being dependent on proprietary capabilities. But I suppose the simple fact of the matter is that the simple fact of the data not being in an RDBMS will force some pricing restraint.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
11/11/2014 | 9:57:19 AM
Proprietary format?
Lorna, I'm not sure what you're talking about when you say "locked into a proprietary format." There are file formats, like JSON and Parguet, that are native to big data platforms, but they are not proprietary. This is a contrast to relational database management systems, all of which use SQL, but there are vendor-specific variants of SQL that make it a pain in the ass to switch from one DBMS to another. According to NoSQL database customer Bryson Koehler, CIO at The Weather Company, one of the beauties of NoSQL databases is that "they don't lock you in," as long as you don't build against proprietary capabilites.
Lorna Garey
50%
50%
Lorna Garey,
User Rank: Author
11/11/2014 | 9:43:48 AM
Re: Core Hadoop is commoditized while all the value is in analytics
"Here's where legacy databases on Hadoop and options from data-management incumbents, like Oracle's Big Data Discovery and Oracle Big Data SQL, just might win the day."

Embrace and extend -- a time-honored tradition that's extending to big data analysis. Doug, is the likelihood of having huge amounts of data locked in to a proprietary format giving CIOs pause? 
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
11/10/2014 | 5:08:16 PM
Hadoop, an open source starting point, not the end
Interesting analysis here. Hadoop is a rare case of where open source MapReduce and the Hadoop Fiile System have pioneered the space, with many follow up modifications and systems improving them afterward. The follow up systems have been a result of corporate  research or company-initiated projects, such as Cloudera's Impala. In many areas, the Apache project became the gold standard for software in that space (Apache Web Server, Apache Tomcat). In Hadoop's case, it became the springboard that launched a thousand improvements and alternatives.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
11/10/2014 | 2:30:14 PM
Core Hadoop is commoditized while all the value is in analytics
Storing big data on Hadoop? No big deal. That's a commodity play and you won't have to pay much for the base-line infrastructure. Data-analysis on top of Hadoop? That's the valuable (read profitable) part that every vendor wants to provide. Cloudera has Impala for SQL and it recently bought DataPad to brew up analytics capabilities based on Phython. Look out Spark/Databricks, that launch will be another slap in the face. Speaking of slaps in the face, I've talked to other Hadoop distributors who publically support Spark and then privately bad-mouth it and tout alternatives, such as Apache Storm (for streaming).

Practitioners should forget the politics and focus on using the tools that meet their breadth of analysis needs while also being stable and performant. Here's where legacy databases on Hadoop and options from data-management incumbents, like Oracle's Big Data Discovery and Oracle Big Data SQL, just might win the day -- at least where mainstream enterprises are concerned. Self-respecting Internet giants will always build their own and stick with open-source technologies.
Commentary
The Staying Power of Legacy Systems
Mary E. Shacklett, Mary E. Shacklett,  4/15/2019
Commentary
Q&A: Red Hat's Robert Kratky Discusses Essentials of Docs
Joao-Pierre S. Ruth, Senior Writer,  4/15/2019
Commentary
How Cloud Shifts Security Balance of Power to the Good Guys
Guest Commentary, Guest Commentary,  4/11/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
A New World of IT Management in 2019
This IT Trend Report highlights how several years of developments in technology and business strategies have led to a subsequent wave of changes in the role of an IT organization, how CIOs and other IT leaders approach management, in addition to the jobs of many IT professionals up and down the org chart.
Slideshows
Flash Poll