Facebook On Big Data Analytics: An Insider's View - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Big Data Analytics
03:19 PM
Connect Directly

Facebook On Big Data Analytics: An Insider's View

Facebook's Jay Parikh talks about fixing Hive, real-time platforms and how traditional companies can 'thread the needle' of big data success.

Jay Parikh, Facebook
Jay Parikh, Facebook
Few businesses are on the scale of Facebook, but the problems it's dealing with today might influence the best practices smaller companies will be putting in place tomorrow.

Just as Facebook is shaping big data hardware and data centers through its Open Compute Project initiative, it's also influencing the software tools and platforms for big data analysis, including Hadoop, Hive, graph analysis and more. Hive, Hadoop's data warehousing infrastructure, originated at Facebook, and according to Jay Parikh, VP of infrastructure engineering, the company is hard at work on ways to make Hive work faster and support more SQL query capabilities.

Parikh also tells InformationWeek that Facebook is working on new real-time and graph-analysis platforms, but the heart and soul of this interview is about big data analytics. There's plenty of detail on how Facebook answers operational and business questions, but read on to get Parikh's advice on how to avoid "wasting a lot of money" or "missing huge opportunities" in big data.

InformationWeek: The topic at hand is big data analytics, but let's start by exploring Facebook's infrastructure to get some context.

Jay Parikh: There are a few areas that we invest in to scale massive amounts of data. If you consider just the photos on Facebook, we have more than 250 billion photos on the site and we get 350 million new photos every day. It's a core, immersive experience for our users, so we've had to rethink and innovate at all levels of the stack, not just the software, to manage these files and to serve them, store them and make sure that they're available when users go back through their timeline to view them. That has meant changes at the hardware level, the network level and the data center level. It's a custom stack, and it doesn't involve Hadoop or Hive or any open source big data platforms.

Another area where we invest is in storing user actions. When you "like" something, post a status update or make a friend on Facebook, we use a very distributed, highly optimized, highly customized version of MySQL to store that data. We run the site, basically, storing all of our user action data in MySQL. That's the second pillar.

[ Want more insider info on Facebook? Read Facebook's Data Center: Where Likes Live. ]

The third area is Hadoop infrastructure. We do a lot with Hadoop. It's used in every product and in many different ways. A few years ago we launched a new version of Facebook Messaging, for example, and it runs on top of HBase [the Hadoop NoSQL database framework]. All of the messages you send on mobile and desktop get persisted to HBase. We relied on our expertise in Hadoop and HDFS to scale HBase to store messages.

We also use a version of Hadoop and Hive to run the business, including a lot of our analytics around optimizing our products, generating reports for our third-party developers, who need to know how their applications are running on the site, and generating reports for advertisers, who need to know how their campaigns are doing. All of those analytics are driven off of Hadoop, HDFS, Hive and interfaces that we've developed for developers, internal data scientists, product managers and external advertisers.

IW: Any big changes afoot, particularly where analytic capabilities are concerned?

Parikh: There's lots of hype in the [IT] industry today about everything needing to be real time. That has been true for us for a long time. We push the front-end website code twice a day. We have thousands of different versions of the site running at any given moment. We launched Light Stand, a new version of our newsfeed, last week, and we launched Facebook Graph Search in January. As people are adopting new products like this, we need to understand whether they're working or not. Are people engaged? Are they missing key features? Are they still liking things as much? If the warehouse or analytics platform can't keep up, then we can't come up with new iterations of our products very quickly. Real-time measurement has been a key element for us.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 3
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
5/6/2013 | 12:55:42 PM
re: Facebook On Big Data Analytics: An Insider's View
I also thought he has developed a huge team in order to maintain big data at larger extent.
User Rank: Apprentice
3/26/2013 | 4:29:29 PM
re: Facebook On Big Data Analytics: An Insider's View
Great article, Doug. Glad you brought the viewpoints of the true pioneers, adopters and practitioner's viewpoints for the benefit of the mainstream enterprise. It was very interesting to read how they push front end code twice a day for analysis and Scuba. Any reason why they did not go with established in-memory databases - a technology which is pretty matured when they adopted MySQL for other purposes?
D. Henschen
D. Henschen,
User Rank: Author
3/18/2013 | 4:55:29 PM
re: Facebook On Big Data Analytics: An Insider's View
Parikh is pretty up front about the limitations of Hive that Facebook is tying to overcome, but he makes it clear it will take a yet-to-be-announced new platform -- expected this summer -- to address real-time analysis needs. Given the many real-time initiatives now underway in the Hadoop community, it will be interesting to see whether Facebook's new platform is embraced the way Hive was embraced way back when.
User Rank: Author
3/18/2013 | 2:25:03 PM
re: Facebook On Big Data Analytics: An Insider's View
Sounds like he has developed a team with a large amount of Hadoop expertise. I wonder if they are hiring up a storm from outside, or grooming people who were already there.

Laurianne McLaughlin
IT Careers: Top 10 US Cities for Tech Jobs
Cynthia Harvey, Freelance Journalist, InformationWeek,  1/14/2020
Predictions for Cloud Computing in 2020
James Kobielus, Research Director, Futurum,  1/9/2020
What's Next: AI and Data Trends for 2020 and Beyond
Jessica Davis, Senior Editor, Enterprise Apps,  12/30/2019
White Papers
Register for InformationWeek Newsletters
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
Flash Poll