Facebook On Big Data Analytics: An Insider's View - InformationWeek
Data Management // Big Data Analytics
03:19 PM
Connect Directly
Moving UEBA Beyond the Ground Floor
Sep 20, 2017
This webinar will provide the details you need about UEBA so you can make the decisions on how bes ...Read More>>

Facebook On Big Data Analytics: An Insider's View

Facebook's Jay Parikh talks about fixing Hive, real-time platforms and how traditional companies can 'thread the needle' of big data success.

 Facebook's Futuristic Data Center: Inside Tour
Facebook's Futuristic Data Center: Inside Tour
(click image for larger view and for slideshow)
IW: How are you addressing real-time analytics?

Parikh: A couple of ways. We use Hive [data warehousing infrastructure] to run lots of reports. Hive is something we developed and open sourced, and it runs on top of the Hadoop stack. We also have a system called Scuba, which is a real-time system for analytics. Scuba stores everything in memory so it's really fast, and you can do all sorts of transformations and drill-downs on the data. We use it both for operations data -- site performance metrics, reliability metrics and so on -- and for business data, studying the effectiveness of the advertising system or ranking systems.

We're working on a couple of other things including a new platform that will allow us to query the data in our Hadoop infrastructure much more rapidly. We're building that out, and we're probably going to talk about it this summer.

IW: Hive's lack of speed is well known. So will this new platform solve that problem?

Parikh: We have a number of efforts on real time. Scuba is one but we're also working on Hive extensively, and we're in the process of pushing our contributions back into the open source version. You'll see, over the course of the coming months, some very significant changes that we're going to push into the community to make Hive faster.

[ Want more on big data hardware? Read Facebook Open Compute Project Shapes Big Data Hardware. ]

Hive is still a workhorse and it will remain the workhorse for a long time because it's easy and it scales. Easy is the key thing when you want lots of people to be able to engage with a tool. Hive is very simple to use, so we've been focused on performance to make it even more effective.

IW: Is it all about speed, or are you also working on broader SQL-query capabilities?

Parikh: We're working on both. We're filling some of the gaps in what it can do SQL wise, and we're also working on performance and reliability. There's also this new, unannounced platform that we'll be talking about later this summer that will sit next to both Scuba and Hive. Everything about it is real time, and it will cut down the latency [of Hadoop] significantly.

IW: What about graph analysis? That would seem to be a Facebook specialty since it's about understanding network relationships.

Parikh: Everything in Facebook is represented in some sort of graph [with nodes -- people, organizations, places, brands, etc. -- and edges -- the relationships among those nodes]. We maintain the largest people-object graph in the world, and it's constantly changing, so it's not something you can handle in batch mode. The interactions are constant and you want the results to be fresh. We have to share in a way that lets us scale. All of these capabilities are behind the Graph Search product that we introduced in January.

If you're talking about graph analytics, there's an open-source project out there called Pregel that Google has written about. There's also the Apache Giraph project, which is more about graph analytics and graph processing.

We are also going to be talking about a project later this summer -- probably at the same time we talk about our real-time initiative -- that is a version of graph analytics that sits on top of our Hadoop infrastructure. There are some cool problems we've been able to solve by being able to process [Facebook's] large graph, infer data and make better suggestions to people, whether it be content or ads.

IW: How and where do Facebook's current graph-analysis capabilities operate?

Parikh: A lot of the graph analytics are written and run on the Hive infrastructure. Hive's performance and scale issues make the overall latency of these analytics slower than we would like, and that's one of the reasons we've been investing in those other projects discussed earlier to speed things up and do things more efficiently.

There's another graph processing engine that we've written that sits between our Web tier and our storage tier. That has been around for a long time and it's the real-time engine allows our website to generate the types of experiences that it does today. You can ask, "show me all my friends who like X," and it gives you a sorted and filtered list. It generates each and every page on the site. It's an area that's pretty ripe for innovation.

2 of 3
Comment  | 
Print  | 
More Insights
Oldest First  |  Newest First  |  Threaded View
User Rank: Author
3/18/2013 | 2:25:03 PM
re: Facebook On Big Data Analytics: An Insider's View
Sounds like he has developed a team with a large amount of Hadoop expertise. I wonder if they are hiring up a storm from outside, or grooming people who were already there.

Laurianne McLaughlin
D. Henschen
D. Henschen,
User Rank: Author
3/18/2013 | 4:55:29 PM
re: Facebook On Big Data Analytics: An Insider's View
Parikh is pretty up front about the limitations of Hive that Facebook is tying to overcome, but he makes it clear it will take a yet-to-be-announced new platform -- expected this summer -- to address real-time analysis needs. Given the many real-time initiatives now underway in the Hadoop community, it will be interesting to see whether Facebook's new platform is embraced the way Hive was embraced way back when.
User Rank: Apprentice
3/26/2013 | 4:29:29 PM
re: Facebook On Big Data Analytics: An Insider's View
Great article, Doug. Glad you brought the viewpoints of the true pioneers, adopters and practitioner's viewpoints for the benefit of the mainstream enterprise. It was very interesting to read how they push front end code twice a day for analysis and Scuba. Any reason why they did not go with established in-memory databases - a technology which is pretty matured when they adopted MySQL for other purposes?
User Rank: Apprentice
5/6/2013 | 12:55:42 PM
re: Facebook On Big Data Analytics: An Insider's View
I also thought he has developed a huge team in order to maintain big data at larger extent.
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
IT Strategies to Conquer the Cloud
Chances are your organization is adopting cloud computing in one way or another -- or in multiple ways. Understanding the skills you need and how cloud affects IT operations and networking will help you adapt.
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll