BI On Hadoop Success: 7 Things To Know
When enterprises implement Hadoop, their top use-case was found to be business intelligence (BI). Now a new benchmark study shows which Hadoop SQL engines are best for which workloads. Here is a look at 7 key findings from that study.
![](https://eu-images.contentstack.com/v3/assets/blt69509c9116440be8/blt877837904d8aa9d4/64cb44480665a6696a2acadf/benchmark-Kenishirotie-iStock_71456057_SMALL_Resized.png?width=700&auto=webp&quality=80&disable=upscale)
Business intelligence is the top use-case for IT organizations implementing Hadoop, according to a large industry survey last year. Now a new benchmark study shows which Hadoop ecosystem tools are best for particular types of BI queries.
The recently released study's findings offer perspective for IT organizations on a handful of SQL-on-Hadoop engines, such as Hive, Impala, Presto, and Spark SQL. They provide insights on their performance for particular kinds of analytic jobs. The benchmark comes from AtScale, a company that is focused on helping organizations make business intelligence work on Hadoop.
"Different engines continue to perform well for different jobs," said Josh Klahr, VP of products at AtScale, in an interview with InformationWeek. "IT organizations should probably be wary about making a bet on just one engine -- like putting everything on Hive or on Impala."
This new benchmark released this week is the second edition. It provides insight into how the performance of each of these engines has improved since the last report, released 6 months ago.
[Looking for the key big data technologies you need for a successful infrastructure? Read 7 Keys to Building a Successful Big Data Infrastructure.]
IT organizations will want to take note of the results. Business intelligence is the top use-case that enterprises plan for their Hadoop implementations, according to a 2015 Hadoop Maturity Survey of close to 2,100 business, IT, and C-suite executives conducted by AtScale, Tableau, and the three big Hadoop distributors -- Cloudera, Hortonworks, and MapR.
That report found that ETL and data science workloads on Hadoop were decreasing, while business intelligence had gained momentum. That survey also showed that 69% of organizations cited business intelligence as the top use-case, followed by data science at 56%, and ETL at 51%. In 2014, the same survey showed ETL at 74%, data science at 62%, and business intelligence at 65%.
What's more, getting value out of your Hadoop projects may be directly linked to whether your IT organization has enabled business users to query the Hadoop data directly themselves, according to the 2015 Hadoop Maturity Survey. Providing this self-service access to business users unlocks value for organizations.
The survey showed that of the companies that provided self-service options to users, 61% say they are gaining value from Hadoop. Some 41% of the companies that did not provide self-service options say that they see tangible value, and 59% say that they don't see tangible value.
How do you make sure you are adding the right engines to your Hadoop infrastructure to best enable the fast response time your business users expect on their queries? Here is a look at some notable findings from AtScale's tools benchmark test.
The benchmark results show that there is no one-size-fits-all general purpose engine for executing these types of queries. "Depending on raw data size, query complexity, and the target number of end-users, enterprises will find that each engine has its own 'sweet spot,'" according to the study's findings.
The benchmark shows that Impala and Spark SQL are the stars when it comes to queries against small data sets. AtScale said that the most recent release of Hive LLAP (Live Long and Process) shows acceptable query response times on small data sets, and that Presto also shows promise for these types of queries.
This metric looks at the performance when the data is hit with many queries at the same time. Presto, which AtScale included for the first time in this benchmark test, showed the best results for concurrency testing. Impala continued its strong concurrent query performance. Hive and Spark SQL registered significant improvements on this metric in the current benchmark test.
AtScale's Klahr warns that, while Impala and Presto do well on concurrency, the results shifted as queries became more complex. When it came to complex queries, SparkSQL started to outperform Impala, Klahr told InformationWeek. "You need to have a multi-engine strategy and a mechanism that can automatically route end-user queries to the right engine without the end-user having to think about 'Am I writing a Spark query or an Impala query?'" he said, noting that AtScale does perform that kind of automatic routing to the best engine.
Querying big data sets generally means slower results. The fastest performing engines for these data sets were Spark SQL at less than 20 seconds, followed by Impala at less than 40 seconds. Response times for both of these engines improved significantly from the benchmark six months ago to today. Hive and Presto returned results in just over 2 minutes. Increasing the number of joins generally increased processing time, according to AtScale. Spark SQL and Impala were more likely to perform best as the number of joins increased.
All the engines that were evaluated registered significant performance improvements since AtScale's last benchmark test 6 months ago -- on the order of 2x to 4x, according to the company. "This is great news for those enterprises deploying BI workloads to Hadoop. We believe that a best-of-breed strategy -- best engine, best semantic Bilayer, best visualization tool -- will lead enterprises down the most successful path to BI-on-Hadoop success," the company said in its benchmark report.
Klahr told InformationWeek in an interview that between the first edition of the benchmark 6 months ago and today, the query performance of Hive improved by 3.5x, Spark by 2.5x, and Impala by 3x. "If I'm a buyer or an executive, these improvements are going to make me stop and question any investment on a proprietary Hadoop engine," Klahr said, because these open source tools are being improved at a rapid pace.
Klahr told InformationWeek in an interview that between the first edition of the benchmark 6 months ago and today, the query performance of Hive improved by 3.5x, Spark by 2.5x, and Impala by 3x. "If I'm a buyer or an executive, these improvements are going to make me stop and question any investment on a proprietary Hadoop engine," Klahr said, because these open source tools are being improved at a rapid pace.
-
About the Author(s)
You May Also Like