How Netflix, Zynga Beat Amazon Cloud Failure - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Cloud // Infrastructure as a Service
News
11/30/2011
04:38 PM
Connect Directly
Twitter
RSS
E-Mail
50%
50%

How Netflix, Zynga Beat Amazon Cloud Failure

When Amazon Web Services crashed in April, Netflix and Zynga kept operating because they designed their systems to accommodate that possibility.

Slideshow: Amazon's Case For Enterprise Cloud Computing
Slideshow: Amazon's Case For Enterprise Cloud Computing
(click image for larger view and for full slideshow)
Architects for Netflix and the popular social networking game provider Zynga said they were able to cope with Amazon Web Services' outage over the Easter weekend and meet other challenges in using the public cloud without disrupting their businesses.

Both Netflix and Zynga depend on Amazon's public infrastructure as a service for a critical part of their operations. Netflix uses the Elastic Compute Cloud to convert analog copies of old films into digital content that can be streamed over the Internet to customers. Zynga launches a game on Amazon, plots its demand curve, and brings it in-house only when it's clear it won't lose subscribers by failing to meet early demand. The auto-scaling feature of Amazon enables both companies to cope with large fluctuations in demand.

When Amazon experienced what it termed a "re-mirroring storm" in April, data access and instance launch services were tied up in its Northern Virginia data center that serves as a primary site for Netflix operations. Adrian Cockcroft, director of cloud systems architecture for Netflix, said his firm had architected its systems in EC2 for failure and spread them across three availability zones--the equivalent of three separate data centers in Amazon parlance.

But the Easter weekend outage was so pervasive that services that failed in one availability zone tied up services in other zones, according to some companies, such as the Mashery, a San Francisco firm that manages high traffic APIs for other companies.

[ To learn more, see Cloud Connect: Netflix Finds A Home In The Amazon Cloud ]

Randy Bias, CTO of Cloudscaling, a builder of clouds for service providers, moderated an appearance by Cockcroft at CloudBeat on Wednesday. CloudBeat is a two-day conference on cloud computing in Redwood City, Calif., sponsored by the venture capital-oriented website, VentureBeat. After the Easter weekend outage, Netflix said its operations were unaffected. "What was the actual impact on Netflix?" Bias asked at CloudBeat, and Cockcroft elaborated.

"Everyone [in Netflix IT] was complaining. We were seeing a higher error rate, which we did something about," he recounted. As Amazon customers such as Netflix attempted to launch virtual servers or retrieve data on April 22, the launches would fail due to the extreme degree to which Amazon's compute resources were tied up coping with the remirroring storm. Netflix existing servers continued running, and by the second day, Netflix decided to move its infrastructure out of the worst affected zone into the two that retained more of their operational capacity.

"One zone was primarily affect. We vacated the affected zone. We had to do that manually," Cockcroft recalled.

"Amazon doesn't do that for you?" Bias asked.

"No," Cockcroft answered. But Netflix had understood the risk going into the public cloud. "Everything can go away at any time. ... Our backend systems are highly redundant across three availability zones. Think of an availability zone as a separate data center that's one more millisecond in latency away," he said.

Cockcroft said Netflix is confident enough of public cloud infrastructure that it is now conducting its movie digitization and downloading movies to customers from inside Amazon, with core corporate systems still running in Netflix data centers. It stopped building new data centers in 2009 and began a shift into Amazon infrastructure in 2010 to cope with a rapid build-up of its subscriber base.

Executing a strategy where the cloud plays an increasingly important role in your company's strategy alongside its own data center is like riding two horses side by side, "with a foot in one stirrup for each one," Cockcroft said.

Adam Selipsky, VP of marketing at Amazon Web Services, preceded Cockcroft to the stage at CloudBeat and commented, "Netflix is wonderfully architected. They purposely blow away parts of a system to make sure they don't have points of failure. They came through without downtime." Netflix has introduced the concept of the Chaos Monkey to its developers and told them to build redundancy into their systems. When the Chaos Monkey is unleashed, selected parts of a system are disrupted to see whether it can continue operating as a whole.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
Slideshows
10 RPA Vendors to Watch
Jessica Davis, Senior Editor, Enterprise Apps,  8/20/2019
Commentary
Enterprise Guide to Digital Transformation
Cathleen Gagne, Managing Editor, InformationWeek,  8/13/2019
Slideshows
IT Careers: How to Get a Job as a Site Reliability Engineer
Cynthia Harvey, Freelance Journalist, InformationWeek,  7/31/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Data Science and AI in the Fast Lane
This IT Trend Report will help you gain insight into how quickly and dramatically data science is influencing how enterprises are managed and where they will derive business success. Read the report today!
Slideshows
Flash Poll