Facebook On Big Data Analytics: An Insider's ViewFacebook On Big Data Analytics: An Insider's View
Facebook's Jay Parikh talks about fixing Hive, real-time platforms and how traditional companies can 'thread the needle' of big data success.
March 15, 2013
Jay Parikh, FacebookJay Parikh, Facebook
Few businesses are on the scale of Facebook, but the problems it's dealing with today might influence the best practices smaller companies will be putting in place tomorrow.
Just as Facebook is shaping big data hardware and data centers through its Open Compute Project initiative, it's also influencing the software tools and platforms for big data analysis, including Hadoop, Hive, graph analysis and more. Hive, Hadoop's data warehousing infrastructure, originated at Facebook, and according to Jay Parikh, VP of infrastructure engineering, the company is hard at work on ways to make Hive work faster and support more SQL query capabilities.
Parikh also tells InformationWeek that Facebook is working on new real-time and graph-analysis platforms, but the heart and soul of this interview is about big data analytics. There's plenty of detail on how Facebook answers operational and business questions, but read on to get Parikh's advice on how to avoid "wasting a lot of money" or "missing huge opportunities" in big data.
InformationWeek: The topic at hand is big data analytics, but let's start by exploring Facebook's infrastructure to get some context.
Jay Parikh: There are a few areas that we invest in to scale massive amounts of data. If you consider just the photos on Facebook, we have more than 250 billion photos on the site and we get 350 million new photos every day. It's a core, immersive experience for our users, so we've had to rethink and innovate at all levels of the stack, not just the software, to manage these files and to serve them, store them and make sure that they're available when users go back through their timeline to view them. That has meant changes at the hardware level, the network level and the data center level. It's a custom stack, and it doesn't involve Hadoop or Hive or any open source big data platforms.
Another area where we invest is in storing user actions. When you "like" something, post a status update or make a friend on Facebook, we use a very distributed, highly optimized, highly customized version of MySQL to store that data. We run the site, basically, storing all of our user action data in MySQL. That's the second pillar.
[ Want more insider info on Facebook? Read Facebook's Data Center: Where Likes Live. ]
The third area is Hadoop infrastructure. We do a lot with Hadoop. It's used in every product and in many different ways. A few years ago we launched a new version of Facebook Messaging, for example, and it runs on top of HBase [the Hadoop NoSQL database framework]. All of the messages you send on mobile and desktop get persisted to HBase. We relied on our expertise in Hadoop and HDFS to scale HBase to store messages.
We also use a version of Hadoop and Hive to run the business, including a lot of our analytics around optimizing our products, generating reports for our third-party developers, who need to know how their applications are running on the site, and generating reports for advertisers, who need to know how their campaigns are doing. All of those analytics are driven off of Hadoop, HDFS, Hive and interfaces that we've developed for developers, internal data scientists, product managers and external advertisers.
IW: Any big changes afoot, particularly where analytic capabilities are concerned?
Parikh: There's lots of hype in the [IT] industry today about everything needing to be real time. That has been true for us for a long time. We push the front-end website code twice a day. We have thousands of different versions of the site running at any given moment. We launched Light Stand, a new version of our newsfeed, last week, and we launched Facebook Graph Search in January. As people are adopting new products like this, we need to understand whether they're working or not. Are people engaged? Are they missing key features? Are they still liking things as much? If the warehouse or analytics platform can't keep up, then we can't come up with new iterations of our products very quickly. Real-time measurement has been a key element for us.
Facebook's Futuristic Data Center: Inside Tour
Facebook's Futuristic Data Center: Inside Tour(click image for larger view and for slideshow)
IW: How are you addressing real-time analytics?
Parikh: A couple of ways. We use Hive [data warehousing infrastructure] to run lots of reports. Hive is something we developed and open sourced, and it runs on top of the Hadoop stack. We also have a system called Scuba, which is a real-time system for analytics. Scuba stores everything in memory so it's really fast, and you can do all sorts of transformations and drill-downs on the data. We use it both for operations data -- site performance metrics, reliability metrics and so on -- and for business data, studying the effectiveness of the advertising system or ranking systems.
We're working on a couple of other things including a new platform that will allow us to query the data in our Hadoop infrastructure much more rapidly. We're building that out, and we're probably going to talk about it this summer.
IW: Hive's lack of speed is well known. So will this new platform solve that problem?
Parikh: We have a number of efforts on real time. Scuba is one but we're also working on Hive extensively, and we're in the process of pushing our contributions back into the open source version. You'll see, over the course of the coming months, some very significant changes that we're going to push into the community to make Hive faster.
[ Want more on big data hardware? Read Facebook Open Compute Project Shapes Big Data Hardware. ]
Hive is still a workhorse and it will remain the workhorse for a long time because it's easy and it scales. Easy is the key thing when you want lots of people to be able to engage with a tool. Hive is very simple to use, so we've been focused on performance to make it even more effective.
IW: Is it all about speed, or are you also working on broader SQL-query capabilities?
Parikh: We're working on both. We're filling some of the gaps in what it can do SQL wise, and we're also working on performance and reliability. There's also this new, unannounced platform that we'll be talking about later this summer that will sit next to both Scuba and Hive. Everything about it is real time, and it will cut down the latency [of Hadoop] significantly.
IW: What about graph analysis? That would seem to be a Facebook specialty since it's about understanding network relationships.
Parikh: Everything in Facebook is represented in some sort of graph [with nodes -- people, organizations, places, brands, etc. -- and edges -- the relationships among those nodes]. We maintain the largest people-object graph in the world, and it's constantly changing, so it's not something you can handle in batch mode. The interactions are constant and you want the results to be fresh. We have to share in a way that lets us scale. All of these capabilities are behind the Graph Search product that we introduced in January.
If you're talking about graph analytics, there's an open-source project out there called Pregel that Google has written about. There's also the Apache Giraph project, which is more about graph analytics and graph processing.
We are also going to be talking about a project later this summer -- probably at the same time we talk about our real-time initiative -- that is a version of graph analytics that sits on top of our Hadoop infrastructure. There are some cool problems we've been able to solve by being able to process [Facebook's] large graph, infer data and make better suggestions to people, whether it be content or ads.
IW: How and where do Facebook's current graph-analysis capabilities operate?
Parikh: A lot of the graph analytics are written and run on the Hive infrastructure. Hive's performance and scale issues make the overall latency of these analytics slower than we would like, and that's one of the reasons we've been investing in those other projects discussed earlier to speed things up and do things more efficiently.
There's another graph processing engine that we've written that sits between our Web tier and our storage tier. That has been around for a long time and it's the real-time engine allows our website to generate the types of experiences that it does today. You can ask, "show me all my friends who like X," and it gives you a sorted and filtered list. It generates each and every page on the site. It's an area that's pretty ripe for innovation.
13 Big Data Vendors To Watch In 2013
13 Big Data Vendors To Watch In 2013 (click image for larger view and for slideshow)
IW: Are there graph-analysis possibilities for ordinary companies and is the technology very mature?
Parikh: Graphs are not new, but there are definitely more technologies available now in terms of commercial and open-source graph databases. It's yet another cool piece of technology that lets you derive insight, but it's not going to supplant enterprise applications like fraud-detection or e-commerce that already highly optimized on relational databases.
The ecosystem around graph technology is very under-developed, and I don't think it will ever become as developed as the relational world because it's not general purpose. Graphs will develop, but it's going to be just yet another piece of technology that lets companies carve off and optimize a few key applications.
IW: Do you have any advice for enterprise IT shops venturing into big data?
Parikh: You're going to have the big-data Hadoop-Hive world, and then you're going to have some specialized real-time systems and you're going to have some specialized graph processing engines. Most IT shops, if they're good and they have a lot of applications to deal with, are going to end up in this world.
Everybody is dealing with scale today, and it's getting to be a more difficult challenge in terms of the amount of data that people want to collect and analyze. Sometimes companies are collecting data and they don't know what to do with it yet, or they're collecting data that they don't even know they have. The fundamental problems are how do you store it, how do you process it and how do you derive useful insights? If you aren't careful as you build out big data applications, you stand to waste a lot of money or you stand to miss huge opportunities in your business. Threading that needle is what every tech company in the world has to do, and most companies won't be able to do it well.
IW: Why not?
Parikh: It's very hard to manage the balance between storing too much and then trying to find something valuable or partitioning your data among different business units and not being able to get insight across the business. We're in an early phase of this technology. It's not something that's insurmountable and people are figuring it out. But storing the data, determining what you do with it, writing the applications and responding to the insight from the data is the balancing act that every tech organization is going to work on.
IW: The "wasting a lot of money" danger is pretty clear -- too much data, too little value. Any advice on how not to miss the opportunity?
Parikh: It's crucial to understand the data that you're collecting and to react to it to change your business. If you're just focused on the tip of the data, you may be missing a longer-term trend. You might be fixated on just a couple of bits of data and not looking at other bits that might be significant. You need a micro, laser focus on impact, but you also need to have a broad perspective on where you're going with all the data.
You may be focused on decisions with real-time data, but are you missing a longer-term impact on your business if you're not looking at your entire data set? It takes a lot of iteration and experimentation to succeed. It's an exciting time and there are lots of cool things for enterprises to try, but it's hard work and the technologies are still maturing.
The Enterprise Connect conference program covers the full range of platforms, services and applications that comprise modern communications and collaboration systems. Hear case studies from senior enterprise executives, as well as from the leaders of major industry players like Cisco, Microsoft, Avaya, Google and more. Register for Enterprise Connect 2013 today with code IWKPREM to save $200 off a conference pass or get a free Expo Pass. It happens March 18-21 in Orlando, Fla.
About the Author(s)
You May Also Like