Big Data // Big Data Analytics
News
3/15/2013
03:19 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Facebook On Big Data Analytics: An Insider's View

Facebook's Jay Parikh talks about fixing Hive, real-time platforms and how traditional companies can 'thread the needle' of big data success.

 Facebook's Futuristic Data Center: Inside Tour
Facebook's Futuristic Data Center: Inside Tour
(click image for larger view and for slideshow)
IW: How are you addressing real-time analytics?

Parikh: A couple of ways. We use Hive [data warehousing infrastructure] to run lots of reports. Hive is something we developed and open sourced, and it runs on top of the Hadoop stack. We also have a system called Scuba, which is a real-time system for analytics. Scuba stores everything in memory so it's really fast, and you can do all sorts of transformations and drill-downs on the data. We use it both for operations data -- site performance metrics, reliability metrics and so on -- and for business data, studying the effectiveness of the advertising system or ranking systems.

We're working on a couple of other things including a new platform that will allow us to query the data in our Hadoop infrastructure much more rapidly. We're building that out, and we're probably going to talk about it this summer.

IW: Hive's lack of speed is well known. So will this new platform solve that problem?

Parikh: We have a number of efforts on real time. Scuba is one but we're also working on Hive extensively, and we're in the process of pushing our contributions back into the open source version. You'll see, over the course of the coming months, some very significant changes that we're going to push into the community to make Hive faster.

[ Want more on big data hardware? Read Facebook Open Compute Project Shapes Big Data Hardware. ]

Hive is still a workhorse and it will remain the workhorse for a long time because it's easy and it scales. Easy is the key thing when you want lots of people to be able to engage with a tool. Hive is very simple to use, so we've been focused on performance to make it even more effective.

IW: Is it all about speed, or are you also working on broader SQL-query capabilities?

Parikh: We're working on both. We're filling some of the gaps in what it can do SQL wise, and we're also working on performance and reliability. There's also this new, unannounced platform that we'll be talking about later this summer that will sit next to both Scuba and Hive. Everything about it is real time, and it will cut down the latency [of Hadoop] significantly.

IW: What about graph analysis? That would seem to be a Facebook specialty since it's about understanding network relationships.

Parikh: Everything in Facebook is represented in some sort of graph [with nodes -- people, organizations, places, brands, etc. -- and edges -- the relationships among those nodes]. We maintain the largest people-object graph in the world, and it's constantly changing, so it's not something you can handle in batch mode. The interactions are constant and you want the results to be fresh. We have to share in a way that lets us scale. All of these capabilities are behind the Graph Search product that we introduced in January.

If you're talking about graph analytics, there's an open-source project out there called Pregel that Google has written about. There's also the Apache Giraph project, which is more about graph analytics and graph processing.

We are also going to be talking about a project later this summer -- probably at the same time we talk about our real-time initiative -- that is a version of graph analytics that sits on top of our Hadoop infrastructure. There are some cool problems we've been able to solve by being able to process [Facebook's] large graph, infer data and make better suggestions to people, whether it be content or ads.

IW: How and where do Facebook's current graph-analysis capabilities operate?

Parikh: A lot of the graph analytics are written and run on the Hive infrastructure. Hive's performance and scale issues make the overall latency of these analytics slower than we would like, and that's one of the reasons we've been investing in those other projects discussed earlier to speed things up and do things more efficiently.

There's another graph processing engine that we've written that sits between our Web tier and our storage tier. That has been around for a long time and it's the real-time engine allows our website to generate the types of experiences that it does today. You can ask, "show me all my friends who like X," and it gives you a sorted and filtered list. It generates each and every page on the site. It's an area that's pretty ripe for innovation.

Previous
2 of 3
Next
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
TomMcGrath
50%
50%
TomMcGrath,
User Rank: Apprentice
5/6/2013 | 12:55:42 PM
re: Facebook On Big Data Analytics: An Insider's View
I also thought he has developed a huge team in order to maintain big data at larger extent.
http://www.bigdatacompanies.co...
PThakuria
50%
50%
PThakuria,
User Rank: Apprentice
3/26/2013 | 4:29:29 PM
re: Facebook On Big Data Analytics: An Insider's View
Great article, Doug. Glad you brought the viewpoints of the true pioneers, adopters and practitioner's viewpoints for the benefit of the mainstream enterprise. It was very interesting to read how they push front end code twice a day for analysis and Scuba. Any reason why they did not go with established in-memory databases - a technology which is pretty matured when they adopted MySQL for other purposes?
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
3/18/2013 | 4:55:29 PM
re: Facebook On Big Data Analytics: An Insider's View
Parikh is pretty up front about the limitations of Hive that Facebook is tying to overcome, but he makes it clear it will take a yet-to-be-announced new platform -- expected this summer -- to address real-time analysis needs. Given the many real-time initiatives now underway in the Hadoop community, it will be interesting to see whether Facebook's new platform is embraced the way Hive was embraced way back when.
Laurianne
50%
50%
Laurianne,
User Rank: Author
3/18/2013 | 2:25:03 PM
re: Facebook On Big Data Analytics: An Insider's View
Sounds like he has developed a team with a large amount of Hadoop expertise. I wonder if they are hiring up a storm from outside, or grooming people who were already there.

Laurianne McLaughlin
InformationWeek
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.