Airbnb Boosts Presto SQL Query Engine For Hadoop - InformationWeek
IoT
IoT
Data Management // Big Data Analytics
News
3/5/2015
11:35 AM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
100%
0%

Airbnb Boosts Presto SQL Query Engine For Hadoop

Airbnb is open sourcing a home-grown data-exploration and SQL query tool for Hadoop. Will it give Facebook Presto a leg up on Cloudera Impala?

7 Linux Facts That Will Surprise You
7 Linux Facts That Will Surprise You
(Click image for larger view and slideshow.)

Airbnb, the data-driven travel giant, announced Thursday that it's donating an internally developed tool called Airpal to open source, a move that could give Facebook-developed Presto an edge in SQL-on-Hadoop querying.

Airpal is a Web-based data-exploration and SQL query interface that runs on Presto, the in-memory SQL-on-Hadoop query technology that Facebook donated to Apache open source in late 2013. Airbnb invented Airpal because it needed a tool that would be more accessible to data analysts and even business users, not just the 23-person Airbnb data science team that handles Hive and Presto queries.

"Airpal has reduced the barriers to entry for running queries," said James Mayfield, an Airbnb product manager, in a phone interview with InformationWeek. "We have marketing teams and search teams that want to engage with our data sets, but they had to go through our small data-science team to get answers. After 10 months in production, we now have more than 500 users who have written queries using Airpal."

[ Want more on this topic? Read Big Data Analytics: Time For New Tools. ]

Airpal, like the Presto query engine, uses SQL. Business users aren't typically SQL savvy, but Mayfield says Airpal provides a Web interface to data and histories of prior queries that makes it possible for untrained users to pick up on the query language.

"People can get a look at the data and build off queries that others have run," he explains. "We've been blown away by the number of people who started jumping into data sets and writing queries, even if they had never written SQL queries before."

Airbnb has donated its Airpal interface for data exploration (top) and SQL querying (below) to Apache open source.

Airbnb has donated its Airpal interface for data exploration (top) and SQL querying (below) to Apache open source.

Until 18 months ago, Airbnb, which runs entirely in Amazon's cloud, was using Amazon Redshift, the relational database service, for fast analysis of data from the company's 1.5-petabyte Hive data warehouse. A component of the Hadoop framework, Hive remains Airbnb's "single source of truth" for analysis of historical transactional information, but Airbnb switched from Redshift to Presto for fast, ad hoc querying to avoid time-consuming extract, transform, and loading work.

"Putting data into Redshift was a pain, with an ETL process that essentially doubled the amount of work we had to do every night," Airbnb software engineer Andy Kramolisch told InformationWeek. "We also work in many different countries around the world, and sometimes different languages and character sets would break the database loading process."

Unconstrained by SQL, Hadoop ingests just about any type of data. Airbnb runs Cloudera's distribution of Hadoop on Amazon's cloud infrastructure, but it decided against using Cloudera Impala, that vendor's SQL-on-Hadoop option.

"We looked at Impala, but it would have been a lot more difficult to set up," Kramolisch said. "Impala is on a C++ code base, and most of our developers are more familiar with Java."

Airbnb also stores all of its data in the RC (row/column) file format, whereas Impala relies on the Parquet file format, according to Mayfield.

"A big part of what try to do here is keep things simple," said Mayfield. "The more transformations that we have to make to data and the more separate systems that we have to maintain, the more incremental developer and maintenance costs we have to incur, so we try to stick with the simplest, cleanest stack possible."

What Airpal adds that Presto lacks, according to Airbnb, is a battery of features that enable non-data-scientists to:

  • Search and find tables
  • See metadata, partitions, schemas, and sample rows
  • Write queries in an easy-to-read editor
  • Submit queries through a Web interface
  • Track query progress
  • Get the results back through the browser as a CSV
  • Create new Hive tables based on the results of a query
  • Save queries and search histories of all queries run within the tool.

As a complement to Presto, Airpal can be used in conjunction with Hive, HDFS, Kafka, Cassandra, MySQL, Postgresql, and JDBC sources, according to Airbnb. The tool also includes access-control capabilities tied to LDAP, so you can limit users from seeing, and therefore querying or seeing query histories tied to, specific tables.

Now that Airpal is in open source, it's up to a wider community to add features and grow adoption. Leading big data practitioners using Presto include Facebook, Qubole, Treasure Data, Netflix, DropBox, and others.

"Presto has had some amazing traction, helping data scientists run more queries and get faster results," said James Pearce, head of Facebook open source initiatives, in a statement from Airbnb. "Airpal will make a good addition to those interested in an open source front end to the query engine."

Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
3/5/2015 | 12:02:22 PM
But what about support?
A friend at Cloudera asks the important question, "what about support?" He's right. This is open source stuff that big data practitioners like Facebook and Airbnb tend to support on their own. We'll see if Hortonworks, MapR, or, shudder, Cloudera, take an interest in supporting Presto and Airpal. For now it's something that self-supporting big data practitioners can experiment with.

The real question with these SQL-on-Hadoop options is just how much SQL to they support? Any opinions on the extent/up-to-dateness of Presto SQL support?
Commentary
The Coming Wave of Regulation over Facial Recognition
James Kobielus, Lead Analyst, SiliconANGLE Wikibon,  10/18/2018
News
10 Top Strategic Predictions for 2019
Jessica Davis, Senior Editor, Enterprise Apps,  10/17/2018
Commentary
AI & Machine Learning: An Enterprise Guide
James M. Connolly, Executive Managing Editor, InformationWeekEditor in Chief,  9/27/2018
Register for InformationWeek Newsletters
Video
Current Issue
The Next Generation of IT Support
The workforce is changing as businesses become global and technology erodes geographical and physical barriers.IT organizations are critical to enabling this transition and can utilize next-generation tools and strategies to provide world-class support regardless of location, platform or device
White Papers
Slideshows
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Sponsored Video
Flash Poll