Yahoo Talks Apache Storm: Real-Time Appeal
Yahoo is committed to Apache Storm, the open event-processing platform, because it's easy to manage and scale and use for personalization as a service, among other uses, says Yahoo executive Sumeet Singh in a Q&A.
10 Big Data Online Courses
10 Big Data Online Courses (Click image for larger view and slideshow.)
How will big-data insight evolve into real-time big-data insight? Yahoo is betting on Apache Storm, an event-processing platform that last month became a top-level project for the Apache Software Foundation.
Storm was invented at BackType and was then contributed to open source after that company was acquired by Twitter. Yahoo has been using Storm for two years, and development has evolved from experimentation to an integral part of the company's data-processing stack. Yahoo is now a major backer of Storm, with five engineers committed to the project -- more than any other company.
Why so high on Storm, and what can this platform do for enterprise-sized organizations? Sumeet Singh, senior director for product management, cloud, and big data platforms, explained how Storm is used at Yahoo and how he expects real-time capabilities to evolve.
InformationWeek: How did you get started with Storm, and what's the appeal?
Sumeet Singh: About two and half years ago we saw a need to take latency out of our systems, and there were plenty of use cases. We had our own incubator project developed out of Yahoo Labs, but Storm was getting pretty popular by that time.
[Want more on Storm for the enterprise? Read Microsoft Brings Storm Stream Analysis To Hadoop.]
We also looked at several commercial solutions, but we were attracted to Storm for a variety of reasons, including the types of applications we had in mind and the types of applications that were possible to develop with Storm. We were attracted to the simplicity of managing the infrastructure. A lot of time goes into that, and Storm scored really well when it came to simplicity of managing large-scale clusters. We also wanted something that could scale seamlessly to handle application scale, infrastructure scale, and resource guarantees to individual applications. Storm does well on all of those fronts.
IW: Are you scaling out on Hadoop or on dedicated Storm clusters?
SS: We started off trying Storm on Hadoop, and we developed what we called Storm on YARN [the management layer added in Hadoop 2.0]. We were the first ones on YARN, and we had already rolled it out at scale in early 2011. But at that time, several things needed to happen to run Storm at scale on YARN. We did release a Storm-on-YARN prototype in Apache open-source, but to get our use cases into production quickly, we switched to an isolated cluster.
IW: What are Yahoo's most prominent use cases for Storm?
SS: We have more than 170 topologies in production, but some of the marquee use cases are for personalization as a platform. We use that both for the Yahoo home page and for international properties.
Figure 1: Sumeet Singh, Yahoo's senior director for product management, cloud, and big data platforms.
We're also offering personalization as a service to other websites outside of Yahoo. That product is called Yahoo Recommends, and some of our publishing partners use it on their websites.
IW: How does personalization work?
SS: We create a profile for every user that we've learned anything about. We use machine learning in the background to apply things that we learn about that user to cater to his or her interests and needs. If you're a sports fan, we try to show you the right kinds of sports content or ads. That happens through Hadoop, where we have historical behavior captured in user profiles, but it also happens through Storm, which we call a lower-latency, real-time service. It's the path where we're trying to infer what you're doing in the moment. The latest context is applied via Storm, so it essentially complements our Hadoop platform and models so we can score content properly.
IW: What's the latency of your Hadoop platform versus that of Storm?
SS: It depends on the application and the batching of data. In Hadoop we have 15-minute use cases, 30-minute use cases, one-hour use cases, and, slowest case, two-hour use cases from the time the event occurs on one of our 30,000 web servers worldwide until the time that event goes into creating a new audience feed. That's batch, and we're constantly in a battle to reduce that latency. The 15-minute use case is for advertising.
With Storm the idea is that there is no latency. There is some, of course, but we're talking about less-than-10-minute-type latency. Storm has many
different paradigms that we apply to our business, and one of them is continuous computing. In this case we're continuously processing events from our web servers, and we apply that data immediately to the user profiles. It's not quite real time, but it's near-real time. The richness of the profile is far greater because we're combining the intelligence from these short bursts as well as the deeper insight from the historical data.
IW: Any other important uses of Storm?
SS: There are a lot of uses in advertising because that's where you get the most value out of reducing latency. Budgeting and reporting are big use cases across advertising so you can control the ads that you're serving against campaigns very closely. You don't want to over-deliver or under-deliver impressions, so we use Storm to process the ad-serving events and control the budgeting and reporting aspects of campaign management on almost all of our ad systems. That capability is now branded as Yahoo Ad Manager for traditional advertising and Yahoo Ad Manager Plus for programmatic buying.
Another interesting use case was recently developed for Flickr. If you've ever used Flickr to store and manage your pictures, you would notice that all of your Flickr pictures have tags. We use these tags in a variety of ways, such as recommending pictures to your social profile or your Flickr circle. One of the ways we're doing tagging is through deep learning behind the scenes, and those categorization and classification algorithms run on Storm. The moment the picture is uploaded, we're tagging it in real time and the tags are applied back to your pictures. We call that Flickr Auto Tagging, and it's applied and stored along with your images within about 1 second [Author's note: this latency figure was revised downward from "15 minutes" at the request of Sumeet Singh, who misstated the stat during our interview].
IW: Is Storm ready for smaller organizations that don't have Yahoo's deep engineering bench?
SS: Yahoo deals with problems with scale, security, and multi-tenancy. We were instrumental in moving Storm from a single-developer-type GitHub repository into a mainstream Apache project and, now, a top-level project.
Figure 2: Yahoo's efforts to advance Storm were recently detailed in a blog post on Tumblr.
Some of the security work that we've done has yet to be committed to open source, and I would say only about 20% of what we've done has made it into the latest release of Storm. But there are a lot of contributions on their way from us, and we're going to continue to harden the platform because enterprise requirements are also requirements for Yahoo. Scale is obviously important to us, but security is also very important for Yahoo because we have some sensitive data, such as email, and we're running on multi-tenant systems.
IW: To put it in perspective, what's the scale of your operation?
SS: On Storm, we recently crossed the 1,000-server threshold. In Hadoop, we have 32,500 servers across 16 clusters. We have roughly 300 applications, so if you don't make those clusters multi-tenant you would end up needing 300 clusters, which would be a management nightmare. The minute you bring everybody's data onto the same cluster, security becomes a natural concern. A lot of effort goes into making sure that the security works in such a way that it doesn't reduce productivity or access to services. At the same time, you need peace of mind that you have audit capabilities, authorization capabilities, and authentication capabilities.
IW: So, what's next for Storm?
SS: We never imagined that we would need to scale Storm to what we're trying to achieve right now. Today our largest Storm cluster is about 250 servers. Storm scales well, but there are limits. We're trying to move beyond those limits and get to thousands of servers. We call that Super Scalability, and we're working very actively on that. In a recent blog we also talked about what we're doing with Heartbeat Servers, distributed cache, and scheduling. Storm also doesn't store states, currently, so we rely on a separate NoSQL store to work in conjunction with Storm. In the majority of cases that's HBase, but we're trying to use Apache Kafka to merge the various data pipelines into the same system.
IW: What's a use case where state-awareness is crucial?
SS: Say you're incrementally building some form of intelligence. Personalization is a great example. You have someone's profile stored, and you add incremental bits of intelligence as you continuously process events in Storm. You need some place to store states so you have the latest snapshot that you can build models against and serve content or ads against. Right now we're storing those states in HBase, but that's separate from Storm. We're trying to bring the two together on the same node. That will speed the processing and throughput of the system.
IW: That's clearly in development. Where is Storm today in running on Hadoop and handling routine enterprise-scale work?
SS: The Storm users I know of today -- like Flipboard, Twitter, Alibaba, and Rock Fuel -- are all Internet companies. I don't have a good example of an enterprise running storm, but I see no reason why they couldn't use it. Nobody is running Storm multi-tenant other than Yahoo, because we did all that work, but there's a lot of movement on running Storm with YARN on Hadoop. As I said earlier, we did a Storm-on-YARN proof-of-concept two years ago, but we're still running Storm standalone. Hopefully, next year you'll start to see Storm running in production at scale on YARN.
What will you use for your big-data platform? A high-scale relational database? NoSQL database? Hadoop? Event-processing technology? One size doesn't fit all. Here's how to decide. Get the new Pick Your Platform For Big Data issue of InformationWeek Tech Digest today. (Free registration required.)
About the Author
You May Also Like
Maximizing cloud potential: Building and operating an effective Cloud Center of Excellence (CCoE)
September 10, 2024Radical Automation of ITSM
September 19, 2024Unleash the power of the browser to secure any device in minutes
September 24, 2024