Parikh: A couple of ways. We use Hive [data warehousing infrastructure] to run lots of reports. Hive is something we developed and open sourced, and it runs on top of the Hadoop stack. We also have a system called Scuba, which is a real-time system for analytics. Scuba stores everything in memory so it's really fast, and you can do all sorts of transformations and drill-downs on the data. We use it both for operations data -- site performance metrics, reliability metrics and so on -- and for business data, studying the effectiveness of the advertising system or ranking systems.
We're working on a couple of other things including a new platform that will allow us to query the data in our Hadoop infrastructure much more rapidly. We're building that out, and we're probably going to talk about it this summer.
IW: Hive's lack of speed is well known. So will this new platform solve that problem?
Parikh: We have a number of efforts on real time. Scuba is one but we're also working on Hive extensively, and we're in the process of pushing our contributions back into the open source version. You'll see, over the course of the coming months, some very significant changes that we're going to push into the community to make Hive faster.
[ Want more on big data hardware? Read Facebook Open Compute Project Shapes Big Data Hardware. ]
Hive is still a workhorse and it will remain the workhorse for a long time because it's easy and it scales. Easy is the key thing when you want lots of people to be able to engage with a tool. Hive is very simple to use, so we've been focused on performance to make it even more effective.
IW: Is it all about speed, or are you also working on broader SQL-query capabilities?
Parikh: We're working on both. We're filling some of the gaps in what it can do SQL wise, and we're also working on performance and reliability. There's also this new, unannounced platform that we'll be talking about later this summer that will sit next to both Scuba and Hive. Everything about it is real time, and it will cut down the latency [of Hadoop] significantly.
IW: What about graph analysis? That would seem to be a Facebook specialty since it's about understanding network relationships.
Parikh: Everything in Facebook is represented in some sort of graph [with nodes -- people, organizations, places, brands, etc. -- and edges -- the relationships among those nodes]. We maintain the largest people-object graph in the world, and it's constantly changing, so it's not something you can handle in batch mode. The interactions are constant and you want the results to be fresh. We have to share in a way that lets us scale. All of these capabilities are behind the Graph Search product that we introduced in January.
If you're talking about graph analytics, there's an open-source project out there called Pregel that Google has written about. There's also the Apache Giraph project, which is more about graph analytics and graph processing.
We are also going to be talking about a project later this summer -- probably at the same time we talk about our real-time initiative -- that is a version of graph analytics that sits on top of our Hadoop infrastructure. There are some cool problems we've been able to solve by being able to process [Facebook's] large graph, infer data and make better suggestions to people, whether it be content or ads.
IW: How and where do Facebook's current graph-analysis capabilities operate?
Parikh: A lot of the graph analytics are written and run on the Hive infrastructure. Hive's performance and scale issues make the overall latency of these analytics slower than we would like, and that's one of the reasons we've been investing in those other projects discussed earlier to speed things up and do things more efficiently.
There's another graph processing engine that we've written that sits between our Web tier and our storage tier. That has been around for a long time and it's the real-time engine allows our website to generate the types of experiences that it does today. You can ask, "show me all my friends who like X," and it gives you a sorted and filtered list. It generates each and every page on the site. It's an area that's pretty ripe for innovation.