Facebook has one of the largest data warehouses in the world, now surpassing 300 petabytes. That sheer scale has forced Facebook to invent its own tools for working with variably-structured data at high-scale. Facebook was an early adopter of Hadoop, and it developed the Hive interface to support SQL-like querying. Facebook later contributed Hive to open source and it's now a popular tool used by most companies that use Hadoop.
In a blog post announcing its open source contribution, Martin Traverso, a member of Facebook's Presto team, wrote that as Facebook's data volumes grew to petabyte scale, it became clear that it needed something faster than Hive, which relies on relies on relatively slow MapReduce processing behind-the-scenes.
"We evaluated a few external projects, but they were either too nascent or did not meet our requirements for flexibility and scale," wrote Traverso. That led to the Presto project, which was kicked off in the fall of 2012.
[ Want more on the latest vision for big data? Read Cloudera Plans Data Hub Role For Hadoop. ]
Facebook's first Presto production system was up and running in early 2013, and it was rolled out to all of Facebook's data warehouse users this spring, deployed in multiple geographic regions and scaled up on 1,000-node Hadoop clusters. Presto was announced to the world in June, and Traverso wrote in his blog post that the system is now actively used by more than 1,000 employees running more than 30,000 queries against a petabyte of data each day.
In terms of performance, Facebook reports that Presto is "10X better than Hive/MapReduce in terms of CPU efficiency and latency for most queries." It also supports a "large subset" of ANSI SQL, including joins, left/right outer joins, subqueries, and most of the common aggregate and scalar functions, including approximate distinct counts and approximate percentiles.
What's not clear from Facebook's blog post is which external projects it looked at when it decided to build Presto back in 2012. (Facebook wasn't available at press time to verify which version of Hive it's benchmarked against or what degree of ANSI SQL-compliant functionality it supports.) Multiple SQL-on-Hadoop options have been introduced in 2013, including Cloudera's Impala, Pivotal's HAWQ and Hortonwork's improved Hive with enhancements from its Stinger project. Still in development are IBM BigSQL and MapR-driven Apache Drill.
Cloudera and Hortonworks, the two leading Hadoop distributors, both welcomed Facebook's Presto announcement, citing it as an example of the strength of the open-source model. Both also said they would support the technology if it's widely embraced by the Hadoop community. Nonetheless, executives at both companies were guarded about the prospects for Presto adoption, and they interpreted Facebook's achievement in very different ways.
Cloudera hailed Facebook's approach as a validation of its own move away from Hive with Impala -- the implication being that it's a rejection of Hortonwork's faith in Hive as the right architecture for SQL querying.
"Both Presto and Impala are a complete departure from the Hive query engine and MapReduce," Charles Zedlewski, Cloudera's VP of products told InformationWeek. " Facebook is the single largest Hive user in the world, and both we and Facebook concluded that we were never going to get to the level of speed, interactivity, concurrency, and SQL support needed with the old Hive optimizer."
Hortonworks' Shuan Connolly told InformationWeek that Facebook's claimed 10X performance gain over Hive is tied to an old version of Hive. "I believe they benchmarked against Hive 0.10, which was the version before all of the work on Stinger," said Connolly. "Apache Hive is now at 0.12, so there's a new release that has a lot of performance improvements."
The new Hive is 60 to 70 times faster than the old Hive, according to Connolly, and he added that additional performance gains lie ahead through the use of the Hortonworks-developed Tez in-memory technology. Nonetheless, Connolly declined to speculate on the comparative performance of the new Hive to Presto. Like Zedlewski, Connolly was careful to say it would be up to the Hadoop community to decide whether Presto offers performance advantages over either Impala or Hive.
Support for ANSI-standard SQL is another characteristic that will determine the adoption of any SQL-on-Hadoop option. Zedlewski said Cloudera Impala supports virtually all ANSI SQL 1992 functionality and some aspects of SQL 1999 and SQL 2003. Connolly said the latest version of Hive supports the "vast majority" of SQL 1992.
For now the bottom line is that the two leading Hadoop distributors, Cloudera and Hortonworks, welcome Presto and laud Facebook for its efforts. But they're not about to embrace and support Presto unless the community demands adoption and support.
Links to Presto documentation and free downloads of the software are available through Traverso's blog post.
IT leaders must know the trade-offs they face to get NoSQL's scalability, flexibility and cost savings. Also in the When NoSQL Makes Sense issue of InformationWeek: Oregon's experience building an Obamacare exchange. (Free registration required.)