Facebook Open Sources Presto SQL Query Engine - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

08:18 AM
Connect Directly

Facebook Open Sources Presto SQL Query Engine

Facebook commits its Presto SQL-on-Hadoop technology to open source, offering another path to mainstream big data analysis.

Facebook announced Wednesday that it is committing its Presto low-latency, SQL-compliant query system for Hadoop to open source. The move brings yet another fast query option to Hadoop, making it all the more likely the increasingly popular platform will be accessible to SQL-based business intelligence tools and SQL-savvy BI and data-management professionals.

Facebook has one of the largest data warehouses in the world, now surpassing 300 petabytes. That sheer scale has forced Facebook to invent its own tools for working with variably-structured data at high-scale. Facebook was an early adopter of Hadoop, and it developed the Hive interface to support SQL-like querying. Facebook later contributed Hive to open source and it's now a popular tool used by most companies that use Hadoop.

In a blog post announcing its open source contribution, Martin Traverso, a member of Facebook's Presto team, wrote that as Facebook's data volumes grew to petabyte scale, it became clear that it needed something faster than Hive, which relies on relies on relatively slow MapReduce processing behind-the-scenes.

"We evaluated a few external projects, but they were either too nascent or did not meet our requirements for flexibility and scale," wrote Traverso. That led to the Presto project, which was kicked off in the fall of 2012.

[ Want more on the latest vision for big data? Read Cloudera Plans Data Hub Role For Hadoop. ]

Facebook's first Presto production system was up and running in early 2013, and it was rolled out to all of Facebook's data warehouse users this spring, deployed in multiple geographic regions and scaled up on 1,000-node Hadoop clusters. Presto was announced to the world in June, and Traverso wrote in his blog post that the system is now actively used by more than 1,000 employees running more than 30,000 queries against a petabyte of data each day.

In terms of performance, Facebook reports that Presto is "10X better than Hive/MapReduce in terms of CPU efficiency and latency for most queries." It also supports a "large subset" of ANSI SQL, including joins, left/right outer joins, subqueries, and most of the common aggregate and scalar functions, including approximate distinct counts and approximate percentiles.

What's not clear from Facebook's blog post is which external projects it looked at when it decided to build Presto back in 2012. (Facebook wasn't available at press time to verify which version of Hive it's benchmarked against or what degree of ANSI SQL-compliant functionality it supports.) Multiple SQL-on-Hadoop options have been introduced in 2013, including Cloudera's Impala, Pivotal's HAWQ and Hortonwork's improved Hive with enhancements from its Stinger project. Still in development are IBM BigSQL and MapR-driven Apache Drill.

Cloudera and Hortonworks, the two leading Hadoop distributors, both welcomed Facebook's Presto announcement, citing it as an example of the strength of the open-source model. Both also said they would support the technology if it's widely embraced by the Hadoop community. Nonetheless, executives at both companies were guarded about the prospects for Presto adoption, and they interpreted Facebook's achievement in very different ways.

Cloudera hailed Facebook's approach as a validation of its own move away from Hive with Impala -- the implication being that it's a rejection of Hortonwork's faith in Hive as the right architecture for SQL querying.

"Both Presto and Impala are a complete departure from the Hive query engine and MapReduce," Charles Zedlewski, Cloudera's VP of products told InformationWeek. " Facebook is the single largest Hive user in the world, and both we and Facebook concluded that we were never going to get to the level of speed, interactivity, concurrency, and SQL support needed with the old Hive optimizer."

Hortonworks' Shuan Connolly told InformationWeek that Facebook's claimed 10X performance gain over Hive is tied to an old version of Hive. "I believe they benchmarked against Hive 0.10, which was the version before all of the work on Stinger," said Connolly. "Apache Hive is now at 0.12, so there's a new release that has a lot of performance improvements."

The new Hive is 60 to 70 times faster than the old Hive, according to Connolly, and he added that additional performance gains lie ahead through the use of the Hortonworks-developed Tez in-memory technology. Nonetheless, Connolly declined to speculate on the comparative performance of the new Hive to Presto. Like Zedlewski, Connolly was careful to say it would be up to the Hadoop community to decide whether Presto offers performance advantages over either Impala or Hive.

Support for ANSI-standard SQL is another characteristic that will determine the adoption of any SQL-on-Hadoop option. Zedlewski said Cloudera Impala supports virtually all ANSI SQL 1992 functionality and some aspects of SQL 1999 and SQL 2003. Connolly said the latest version of Hive supports the "vast majority" of SQL 1992.

For now the bottom line is that the two leading Hadoop distributors, Cloudera and Hortonworks, welcome Presto and laud Facebook for its efforts. But they're not about to embrace and support Presto unless the community demands adoption and support.

Links to Presto documentation and free downloads of the software are available through Traverso's blog post.

IT leaders must know the trade-offs they face to get NoSQL's scalability, flexibility and cost savings. Also in the When NoSQL Makes Sense issue of InformationWeek: Oregon's experience building an Obamacare exchange. (Free registration required.)

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
4/10/2014 | 4:08:31 PM
Benchmark comparing Hive, Presto, Impala, and InfiniDB
Hi Doug,

Just wanted to let you know InfiniDB sponsored an independent benchmark executed by Radiant Advisors that compares Hive, Impala, Presto, and InfiniDB.  The report is available for download (with registration) at http://www.infinidb.co/blog/open-source-sql-hadoop-query-engines-benchmark.  There is a webinar on April 23rd to review the full details, registration available via the same link.

The Presto technology was runner-up in syntax support and likely has advantages at extreme scale based on it's pedigree. 


A quick summary

InfiniDB supported syntax for all 10 queries, with queries running between 1.28 and 17.62 seconds.

Presto supported syntax for 9 of 10 queries, with queries running between 18.89 and 506.84 seconds.

Impala supported syntax for 7 of 10 queries, with queries running between 3.1 and 69.38 seconds.

Hive 0.11 supported syntax for 7/10 queries, with queries running between 102.59 and 277.18 seconds. 

Hive 0.12 supported syntax for 7/10 queries, with queries running between 91.39 and 325.68 seconds.

     Note that 3 of the 7 queries supported with Hive did not complete due to resource issues.

A chart of the results is available at :  http://bit.ly/1n93muT


Jim Tommaney, CTO InfiniDB
2021 Outlook: Tackling Cloud Transformation Choices
Joao-Pierre S. Ruth, Senior Writer,  1/4/2021
Enterprise IT Leaders Face Two Paths to AI
Jessica Davis, Senior Editor, Enterprise Apps,  12/23/2020
10 IT Trends to Watch for in 2021
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/22/2020
White Papers
Register for InformationWeek Newsletters
The State of Cloud Computing - Fall 2020
The State of Cloud Computing - Fall 2020
Download this report to compare how cloud usage and spending patterns have changed in 2020, and how respondents think they'll evolve over the next two years.
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll