Yahoo Talks Apache Storm: Real-Time Appeal - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Big Data Analytics
News
10/22/2014
09:06 AM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
100%
0%

Yahoo Talks Apache Storm: Real-Time Appeal

Yahoo is committed to Apache Storm, the open event-processing platform, because it's easy to manage and scale and use for personalization as a service, among other uses, says Yahoo executive Sumeet Singh in a Q&A.

10 Big Data Online Courses
10 Big Data Online Courses
(Click image for larger view and slideshow.)

How will big-data insight evolve into real-time big-data insight? Yahoo is betting on Apache Storm, an event-processing platform that last month became a top-level project for the Apache Software Foundation.

Storm was invented at BackType and was then contributed to open source after that company was acquired by Twitter. Yahoo has been using Storm for two years, and development has evolved from experimentation to an integral part of the company's data-processing stack. Yahoo is now a major backer of Storm, with five engineers committed to the project -- more than any other company.

Why so high on Storm, and what can this platform do for enterprise-sized organizations? Sumeet Singh, senior director for product management, cloud, and big data platforms, explained how Storm is used at Yahoo and how he expects real-time capabilities to evolve.

InformationWeek: How did you get started with Storm, and what's the appeal?

Sumeet Singh: About two and half years ago we saw a need to take latency out of our systems, and there were plenty of use cases. We had our own incubator project developed out of Yahoo Labs, but Storm was getting pretty popular by that time.

[Want more on Storm for the enterprise? Read Microsoft Brings Storm Stream Analysis To Hadoop.]

We also looked at several commercial solutions, but we were attracted to Storm for a variety of reasons, including the types of applications we had in mind and the types of applications that were possible to develop with Storm. We were attracted to the simplicity of managing the infrastructure. A lot of time goes into that, and Storm scored really well when it came to simplicity of managing large-scale clusters. We also wanted something that could scale seamlessly to handle application scale, infrastructure scale, and resource guarantees to individual applications. Storm does well on all of those fronts.

IW: Are you scaling out on Hadoop or on dedicated Storm clusters?

SS: We started off trying Storm on Hadoop, and we developed what we called Storm on YARN [the management layer added in Hadoop 2.0]. We were the first ones on YARN, and we had already rolled it out at scale in early 2011. But at that time, several things needed to happen to run Storm at scale on YARN. We did release a Storm-on-YARN prototype in Apache open-source, but to get our use cases into production quickly, we switched to an isolated cluster.

IW: What are Yahoo's most prominent use cases for Storm?

SS: We have more than 170 topologies in production, but some of the marquee use cases are for personalization as a platform. We use that both for the Yahoo home page and for international properties.

Sumeet Singh, Yahoo's senior director for product management, cloud, and big data platforms.
Sumeet Singh, Yahoo's senior director for product management, cloud, and big data platforms.

We're also offering personalization as a service to other websites outside of Yahoo. That product is called Yahoo Recommends, and some of our publishing partners use it on their websites.

IW: How does personalization work?

SS: We create a profile for every user that we've learned anything about. We use machine learning in the background to apply things that we learn about that user to cater to his or her interests and needs. If you're a sports fan, we try to show you the right kinds of sports content or ads. That happens through Hadoop, where we have historical behavior captured in user profiles, but it also happens through Storm, which we call a lower-latency, real-time service. It's the path where we're trying to infer what you're doing in the moment. The latest context is applied via Storm, so it essentially complements our Hadoop platform and models so we can score content properly.

IW: What's the latency of your Hadoop platform versus that of Storm?

SS: It depends on the application and the batching of data. In Hadoop we have 15-minute use cases, 30-minute use cases, one-hour use cases, and, slowest case, two-hour use cases from the time the event occurs on one of our 30,000 web servers worldwide until the time that event goes into creating a new audience feed. That's batch, and we're constantly in a battle to reduce that latency. The 15-minute use case is for advertising.

With Storm the idea is that there is no latency. There is some, of course, but we're talking about less-than-10-minute-type latency. Storm has many

Next Page

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Technocrati
50%
50%
Technocrati,
User Rank: Ninja
10/26/2014 | 9:30:47 PM
Re: How the mighty have fallen

@asksqn     I remember as well, and it is interesting you don't think they will ultimately make it.   Do you think they will eventually sell ?    Who would buy them ?   

Even though their homepage is filled with useless articles on the latest gossip, it still has a little more pizzas then Google has at the moment.  Could Google erase this advantage if it wanted to ?    Of course, but they have yet to do it.  

Why not ?  With all of the resources they have at their disposal - what is the problem with employing what keeps Yahoo afloat ?    

 

I have been and will continue to be a major critic of Yahoo, but somehow they will continue to keep the lights on.

asksqn
50%
50%
asksqn,
User Rank: Ninja
10/26/2014 | 4:37:25 PM
How the mighty have fallen
I remember when Yahoo was the gold standard for email and web content - now it is barely hanging on, desperately trying to stay relevant.  Embracing open source such a Storm was a prescient move, but it makes me wonder how much longer Yahoo can possibly survive given that Google is overtaking the world.
Technocrati
50%
50%
Technocrati,
User Rank: Ninja
10/23/2014 | 10:17:20 PM
Re: Who Knew? Yahoo?

@Doug    It certainly appears Engineering is a strong area for them as it must be of course.   It is fascinating to me how these guys and girls keep chugging away while it appears that the other parts of the equation are not holding up their part of the bargain.  

To be fair the new CEO has made some strides and Yahoo is giving the effort for the most part.   

I am just no fan of Yahoo, however I do respect effort and above-average engineering.

D. Henschen
50%
50%
D. Henschen,
User Rank: Author
10/23/2014 | 9:29:23 PM
Re: Who Knew? Yahoo?
Yahoo has had good technical people behind the scenes for a long time, but that hasn't overcome and won't necessarily fix the business-model problem.
Technocrati
50%
50%
Technocrati,
User Rank: Ninja
10/23/2014 | 7:27:10 PM
Who Knew ? Yahoo ?

Thanks Doug for an excellent interview with Yahoo's engineer.   There are many (  which I am not ) who think Yahoo is either going to make a come back ( depending on whether you even think they were ever there to begin with ) or gain some ground on Google and the like.

When the topic comes up, I often asked Yahoo supporters how they are going to do this exactly ?   Some argue a changed interface while others think signing Katie Couric  would somehow prove Yahoo was moving in a new and improved  direction. .

Well there has to be more to it - however Yahoo seems to be pretty "tight-lipped" about their methods and intentions.   So I was quite surprised to learn of Yahoo use and support of Apache Storm. 

Now this is something with teeth.   You have provided a really good look behind the scenes somewhat.   Not sure if this is the answer for Yahoo, but it certainly cannot hurt.

D. Henschen
50%
50%
D. Henschen,
User Rank: Author
10/22/2014 | 3:18:28 PM
"Near-real-time" defined
Even processing systems often promise sub-second performance, so I wanted to get a better sense of what mushy terms like "near-real-time" and "immediate updates" really mean. I asked Sumeer Singh a follow-up question. That led to a revise in the story, as Sumeer said in our original interview that Flickr Auto-Tagging happens within "15 mintues." That image-recognition-and-tagging feature, which runs on Storm, actually only takes about 1 second, he said. Here's what else he had to say about latency:

http://storm.apache.org talks about a benchmark of millions of tuples processed per second, per node. While we do not measure tuples per second, I can anecdotally tell you that we are easily processing over 500,000 events per second with Apache Storm on our clusters. In a production setup, you are talking about latencies in collecting the event(s) worldwide, sending it to a storm cluster in a particular datacenter or multiple datacenters, processing those events, making sense out of it, and eventually serving or applying that back in the business. That can take a second to few seconds or minutes, but all that time is not spent in Storm processing itself. We are still talking seconds to minutes, but that is end-to-end latency, not just stream processing. A lot depends on the use case itself.

 

Commentary
Enterprise Guide to Digital Transformation
Cathleen Gagne, Managing Editor, InformationWeek,  8/13/2019
Slideshows
IT Careers: How to Get a Job as a Site Reliability Engineer
Cynthia Harvey, Freelance Journalist, InformationWeek,  7/31/2019
Commentary
AI Ethics Guidelines Every CIO Should Read
Guest Commentary, Guest Commentary,  8/7/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Data Science and AI in the Fast Lane
This IT Trend Report will help you gain insight into how quickly and dramatically data science is influencing how enterprises are managed and where they will derive business success. Read the report today!
Slideshows
Flash Poll