News

Facebook On Big Data Analytics: An Insider's View

Doug Henschen
Executive Editor, InformationWeek

Facebook's Jay Parikh talks about fixing Hive, real-time platforms and how traditional companies can 'thread the needle' of big data success.

Jay Parikh, Facebook
Jay Parikh, Facebook
Few businesses are on the scale of Facebook, but the problems it's dealing with today might influence the best practices smaller companies will be putting in place tomorrow.

Just as Facebook is shaping big data hardware and data centers through its Open Compute Project initiative, it's also influencing the software tools and platforms for big data analysis, including Hadoop, Hive, graph analysis and more. Hive, Hadoop's data warehousing infrastructure, originated at Facebook, and according to Jay Parikh, VP of infrastructure engineering, the company is hard at work on ways to make Hive work faster and support more SQL query capabilities.


More Cloud Insights

Webcasts

More >>

White Papers

More >>

Reports

More >>

Parikh also tells InformationWeek that Facebook is working on new real-time and graph-analysis platforms, but the heart and soul of this interview is about big data analytics. There's plenty of detail on how Facebook answers operational and business questions, but read on to get Parikh's advice on how to avoid "wasting a lot of money" or "missing huge opportunities" in big data.

InformationWeek: The topic at hand is big data analytics, but let's start by exploring Facebook's infrastructure to get some context.

Jay Parikh: There are a few areas that we invest in to scale massive amounts of data. If you consider just the photos on Facebook, we have more than 250 billion photos on the site and we get 350 million new photos every day. It's a core, immersive experience for our users, so we've had to rethink and innovate at all levels of the stack, not just the software, to manage these files and to serve them, store them and make sure that they're available when users go back through their timeline to view them. That has meant changes at the hardware level, the network level and the data center level. It's a custom stack, and it doesn't involve Hadoop or Hive or any open source big data platforms.

Another area where we invest is in storing user actions. When you "like" something, post a status update or make a friend on Facebook, we use a very distributed, highly optimized, highly customized version of MySQL to store that data. We run the site, basically, storing all of our user action data in MySQL. That's the second pillar.

[ Want more insider info on Facebook? Read Facebook's Data Center: Where Likes Live. ]

The third area is Hadoop infrastructure. We do a lot with Hadoop. It's used in every product and in many different ways. A few years ago we launched a new version of Facebook Messaging, for example, and it runs on top of HBase [the Hadoop NoSQL database framework]. All of the messages you send on mobile and desktop get persisted to HBase. We relied on our expertise in Hadoop and HDFS to scale HBase to store messages.

We also use a version of Hadoop and Hive to run the business, including a lot of our analytics around optimizing our products, generating reports for our third-party developers, who need to know how their applications are running on the site, and generating reports for advertisers, who need to know how their campaigns are doing. All of those analytics are driven off of Hadoop, HDFS, Hive and interfaces that we've developed for developers, internal data scientists, product managers and external advertisers.

IW: Any big changes afoot, particularly where analytic capabilities are concerned?

Parikh: There's lots of hype in the [IT] industry today about everything needing to be real time. That has been true for us for a long time. We push the front-end website code twice a day. We have thousands of different versions of the site running at any given moment. We launched Light Stand, a new version of our newsfeed, last week, and we launched Facebook Graph Search in January. As people are adopting new products like this, we need to understand whether they're working or not. Are people engaged? Are they missing key features? Are they still liking things as much? If the warehouse or analytics platform can't keep up, then we can't come up with new iterations of our products very quickly. Real-time measurement has been a key element for us.

Page 2: Building A Faster Hive
 1 | 2 | 3  | Next Page » 

Related Reading


Informationweek Discussions

Start the Discussion


InformationWeek encourages readers to engage in spirited, healthy debate, including taking us to task. However, InformationWeek moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. InformationWeek further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
Subscribe to RSS

Resource Links