Yahoo Talks Apache Storm: Real-Time Appeal

Yahoo is committed to Apache Storm, the open event-processing platform, because it's easy to manage and scale and use for personalization as a service, among other uses, says Yahoo executive Sumeet Singh in a Q&A.

different paradigms that we apply to our business, and one of them is continuous computing. In this case we're continuously processing events from our web servers, and we apply that data immediately to the user profiles. It's not quite real time, but it's near-real time. The richness of the profile is far greater because we're combining the intelligence from these short bursts as well as the deeper insight from the historical data.

IW: Any other important uses of Storm?

SS: There are a lot of uses in advertising because that's where you get the most value out of reducing latency. Budgeting and reporting are big use cases across advertising so you can control the ads that you're serving against campaigns very closely. You don't want to over-deliver or under-deliver impressions, so we use Storm to process the ad-serving events and control the budgeting and reporting aspects of campaign management on almost all of our ad systems. That capability is now branded as Yahoo Ad Manager for traditional advertising and Yahoo Ad Manager Plus for programmatic buying.

Another interesting use case was recently developed for Flickr. If you've ever used Flickr to store and manage your pictures, you would notice that all of your Flickr pictures have tags. We use these tags in a variety of ways, such as recommending pictures to your social profile or your Flickr circle. One of the ways we're doing tagging is through deep learning behind the scenes, and those categorization and classification algorithms run on Storm. The moment the picture is uploaded, we're tagging it in real time and the tags are applied back to your pictures. We call that Flickr Auto Tagging, and it's applied and stored along with your images within about 1 second [Author's note: this latency figure was revised downward from "15 minutes" at the request of Sumeet Singh, who misstated the stat during our interview].

IW: Is Storm ready for smaller organizations that don't have Yahoo's deep engineering bench?

SS: Yahoo deals with problems with scale, security, and multi-tenancy. We were instrumental in moving Storm from a single-developer-type GitHub repository into a mainstream Apache project and, now, a top-level project.


Yahoo's efforts to advance Storm were recently detailed in a blog post on Tumblr.

Some of the security work that we've done has yet to be committed to open source, and I would say only about 20% of what we've done has made it into the latest release of Storm. But there are a lot of contributions on their way from us, and we're going to continue to harden the platform because enterprise requirements are also requirements for Yahoo. Scale is obviously important to us, but security is also very important for Yahoo because we have some sensitive data, such as email, and we're running on multi-tenant systems.

IW: To put it in perspective, what's the scale of your operation?

SS: On Storm, we recently crossed the 1,000-server threshold. In Hadoop, we have 32,500 servers across 16 clusters. We have roughly 300 applications, so if you don't make those clusters multi-tenant you would end up needing 300 clusters, which would be a management nightmare. The minute you bring everybody's data onto the same cluster, security becomes a natural concern. A lot of effort goes into making sure that the security works in such a way that it doesn't reduce productivity or access to services. At the same time, you need peace of mind that you have audit capabilities, authorization capabilities, and authentication capabilities.

IW: So, what's next for Storm?

SS: We never imagined that we would need to scale Storm to what we're trying to achieve right now. Today our largest Storm cluster is about 250 servers. Storm scales well, but there are limits. We're trying to move beyond those limits and get to thousands of servers. We call that Super Scalability, and we're working very actively on that. In a recent blog we also talked about what we're doing with Heartbeat Servers, distributed cache, and scheduling. Storm also doesn't store states, currently, so we rely on a separate NoSQL store to work in conjunction with Storm. In the majority of cases that's HBase, but we're trying to use Apache Kafka to merge the various data pipelines into the same system.

IW: What's a use case where state-awareness is crucial?

SS: Say you're incrementally building some form of intelligence. Personalization is a great example. You have someone's profile stored, and you add incremental bits of intelligence as you continuously process events in Storm. You need some place to store states so you have the latest snapshot that you can build models against and serve content or ads against. Right now we're storing those states in HBase, but that's separate from Storm. We're trying to bring the two together on the same node. That will speed the processing and throughput of the system.

IW: That's clearly in development. Where is Storm today in running on Hadoop and handling routine enterprise-scale work?

SS: The Storm users I know of today -- like Flipboard, Twitter, Alibaba, and Rock Fuel -- are all Internet companies. I don't have a good example of an enterprise running storm, but I see no reason why they couldn't use it. Nobody is running Storm multi-tenant other than Yahoo, because we did all that work, but there's a lot of movement on running Storm with YARN on Hadoop. As I said earlier, we did a Storm-on-YARN proof-of-concept two years ago, but we're still running Storm standalone. Hopefully, next year you'll start to see Storm running in production at scale on YARN.

What will you use for your big-data platform? A high-scale relational database? NoSQL database? Hadoop? Event-processing technology? One size doesn't fit all. Here's how to decide. Get the new Pick Your Platform For Big Data issue of InformationWeek Tech Digest today. (Free registration required.)

Editor's Choice
Brandon Taylor, Digital Editorial Program Manager
Jessica Davis, Senior Editor
John Abel, Technical Director, Google Cloud
Cynthia Harvey, Freelance Journalist, InformationWeek
Christopher Gilchrist, Principal Analyst, Forrester
Cynthia Harvey, Freelance Journalist, InformationWeek