Airbnb is open sourcing a home-grown data-exploration and SQL query tool for Hadoop. Will it give Facebook Presto a leg up on Cloudera Impala?
7 Linux Facts That Will Surprise You
(Click image for larger view and slideshow.)
Airbnb, the data-driven travel giant, announced Thursday that it's donating an internally developed tool called Airpal to open source, a move that could give Facebook-developed Presto an edge in SQL-on-Hadoop querying.
Airpal is a Web-based data-exploration and SQL query interface that runs on Presto, the in-memory SQL-on-Hadoop query technology that Facebook donated to Apache open source in late 2013. Airbnb invented Airpal because it needed a tool that would be more accessible to data analysts and even business users, not just the 23-person Airbnb data science team that handles Hive and Presto queries.
"Airpal has reduced the barriers to entry for running queries," said James Mayfield, an Airbnb product manager, in a phone interview with InformationWeek. "We have marketing teams and search teams that want to engage with our data sets, but they had to go through our small data-science team to get answers. After 10 months in production, we now have more than 500 users who have written queries using Airpal."
Airpal, like the Presto query engine, uses SQL. Business users aren't typically SQL savvy, but Mayfield says Airpal provides a Web interface to data and histories of prior queries that makes it possible for untrained users to pick up on the query language.
"People can get a look at the data and build off queries that others have run," he explains. "We've been blown away by the number of people who started jumping into data sets and writing queries, even if they had never written SQL queries before."
Airbnb has donated its Airpal interface for data exploration (top) and SQL querying (below) to Apache open source.
Until 18 months ago, Airbnb, which runs entirely in Amazon's cloud, was using Amazon Redshift, the relational database service, for fast analysis of data from the company's 1.5-petabyte Hive data warehouse. A component of the Hadoop framework, Hive remains Airbnb's "single source of truth" for analysis of historical transactional information, but Airbnb switched from Redshift to Presto for fast, ad hoc querying to avoid time-consuming extract, transform, and loading work.
"Putting data into Redshift was a pain, with an ETL process that essentially doubled the amount of work we had to do every night," Airbnb software engineer Andy Kramolisch told InformationWeek. "We also work in many different countries around the world, and sometimes different languages and character sets would break the database loading process."
Unconstrained by SQL, Hadoop ingests just about any type of data. Airbnb runs Cloudera's distribution of Hadoop on Amazon's cloud infrastructure, but it decided against using Cloudera Impala, that vendor's SQL-on-Hadoop option.
"We looked at Impala, but it would have been a lot more difficult to set up," Kramolisch said. "Impala is on a C++ code base, and most of our developers are more familiar with Java."
Airbnb also stores all of its data in the RC (row/column) file format, whereas Impala relies on the Parquet file format, according to Mayfield.
"A big part of what try to do here is keep things simple," said Mayfield. "The more transformations that we have to make to data and the more separate systems that we have to maintain, the more incremental developer and maintenance costs we have to incur, so we try to stick with the simplest, cleanest stack possible."
What Airpal adds that Presto lacks, according to Airbnb, is a battery of features that enable non-data-scientists to:
Search and find tables
See metadata, partitions, schemas, and sample rows
Write queries in an easy-to-read editor
Submit queries through a Web interface
Track query progress
Get the results back through the browser as a CSV
Create new Hive tables based on the results of a query
Save queries and search histories of all queries run within the tool.
As a complement to Presto, Airpal can be used in conjunction with Hive, HDFS, Kafka, Cassandra, MySQL, Postgresql, and JDBC sources, according to Airbnb. The tool also includes access-control capabilities tied to LDAP, so you can limit users from seeing, and therefore querying or seeing query histories tied to, specific tables.
Now that Airpal is in open source, it's up to a wider community to add features and grow adoption. Leading big data practitioners using Presto include Facebook, Qubole, Treasure Data, Netflix, DropBox, and others.
"Presto has had some amazing traction, helping data scientists run more queries and get faster results," said James Pearce, head of Facebook open source initiatives, in a statement from Airbnb. "Airpal will make a good addition to those interested in an open source front end to the query engine."
Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.
Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
The Next Generation of IT SupportThe workforce is changing as businesses become global and technology erodes geographical and physical barriers.IT organizations are critical to enabling this transition and can utilize next-generation tools and strategies to provide world-class support regardless of location, platform or device