How Netflix, Zynga Beat Amazon Cloud Failure
When Amazon Web Services crashed in April, Netflix and Zynga kept operating because they designed their systems to accommodate that possibility.
Slideshow: Amazon's Case For Enterprise CloudComputing
Slideshow: Amazon's Case For Enterprise Cloud Computing (click image for larger view and for full slideshow)
Architects for Netflix and the popular social networking game provider Zynga said they were able to cope with Amazon Web Services' outage over the Easter weekend and meet other challenges in using the public cloud without disrupting their businesses.
Both Netflix and Zynga depend on Amazon's public infrastructure as a service for a critical part of their operations. Netflix uses the Elastic Compute Cloud to convert analog copies of old films into digital content that can be streamed over the Internet to customers. Zynga launches a game on Amazon, plots its demand curve, and brings it in-house only when it's clear it won't lose subscribers by failing to meet early demand. The auto-scaling feature of Amazon enables both companies to cope with large fluctuations in demand.
When Amazon experienced what it termed a "re-mirroring storm" in April, data access and instance launch services were tied up in its Northern Virginia data center that serves as a primary site for Netflix operations. Adrian Cockcroft, director of cloud systems architecture for Netflix, said his firm had architected its systems in EC2 for failure and spread them across three availability zones--the equivalent of three separate data centers in Amazon parlance.
But the Easter weekend outage was so pervasive that services that failed in one availability zone tied up services in other zones, according to some companies, such as the Mashery, a San Francisco firm that manages high traffic APIs for other companies.
[ To learn more, see Cloud Connect: Netflix Finds A Home In The Amazon Cloud ]
Randy Bias, CTO of Cloudscaling, a builder of clouds for service providers, moderated an appearance by Cockcroft at CloudBeat on Wednesday. CloudBeat is a two-day conference on cloud computing in Redwood City, Calif., sponsored by the venture capital-oriented website, VentureBeat. After the Easter weekend outage, Netflix said its operations were unaffected. "What was the actual impact on Netflix?" Bias asked at CloudBeat, and Cockcroft elaborated.
"Everyone [in Netflix IT] was complaining. We were seeing a higher error rate, which we did something about," he recounted. As Amazon customers such as Netflix attempted to launch virtual servers or retrieve data on April 22, the launches would fail due to the extreme degree to which Amazon's compute resources were tied up coping with the remirroring storm. Netflix existing servers continued running, and by the second day, Netflix decided to move its infrastructure out of the worst affected zone into the two that retained more of their operational capacity.
"One zone was primarily affect. We vacated the affected zone. We had to do that manually," Cockcroft recalled.
"Amazon doesn't do that for you?" Bias asked.
"No," Cockcroft answered. But Netflix had understood the risk going into the public cloud. "Everything can go away at any time. ... Our backend systems are highly redundant across three availability zones. Think of an availability zone as a separate data center that's one more millisecond in latency away," he said.
Cockcroft said Netflix is confident enough of public cloud infrastructure that it is now conducting its movie digitization and downloading movies to customers from inside Amazon, with core corporate systems still running in Netflix data centers. It stopped building new data centers in 2009 and began a shift into Amazon infrastructure in 2010 to cope with a rapid build-up of its subscriber base.
Executing a strategy where the cloud plays an increasingly important role in your company's strategy alongside its own data center is like riding two horses side by side, "with a foot in one stirrup for each one," Cockcroft said.
Adam Selipsky, VP of marketing at Amazon Web Services, preceded Cockcroft to the stage at CloudBeat and commented, "Netflix is wonderfully architected. They purposely blow away parts of a system to make sure they don't have points of failure. They came through without downtime." Netflix has introduced the concept of the Chaos Monkey to its developers and told them to build redundancy into their systems. When the Chaos Monkey is unleashed, selected parts of a system are disrupted to see whether it can continue operating as a whole.
Slideshow: Cloud Security Pros And Cons
Slideshow: Cloud Security Pros And Cons (click image for larger view and for full slideshow)
"You can architect for failure. Individual servers, storage arrays can fail and you can still stay up," said Selipsky. When asked what lessons Amazon had learned from the Easter outage, he said it had introduced more separation of systems to prevent a failure of one service from interfering with others.
But he stubbornly maintained that the service tie up in Northern Virginia was not as great an incident as has been reported and not an EC2 cloud outage. "The incidents you're talking about were fairly contained. We had five regions worldwide. One availability zone in that one region was affected," he said.
Then Selipsky added: "We've taken steps to separate systems, to decouple things that were coupled to each other. We've repaired software glitches and taken other steps to make sure it doesn't happen again."
Another speaker, Allan Leinwand, CTO of infrastructure engineering at Zynga, said firms must understand how to fit the Amazon infrastructure into what they want to do. In Zynga's case, it's built a private cloud--the Z cloud--that is similar to EC2, and it can move the operation of its games between the two. Zynga launched Cityville in the Amazon cloud In November 2010, allowed it to pick up steam, then as growth slowed, brought it back inside to its Z cloud. It's most recent offering, CastleVille, was launched the same way.
"We love the public cloud. Amazon has done exceptional job," said Leinwand. "But Amazon is a four-door sedan. I love four-door sedans. I drive one. But maybe your application needs a fast sports car or a Winabago or an 18-wheeler. In the Amazon cloud, a four-door is what you get."
The key, Leinwand continued, is understanding the needs of your application. Zynga has games that may grow slowly for several weeks after launch, then reach a critical mass that causes them to add millions of users in a short time. The scalable Amazon infrastructure is good for hosting that expansion.
But Zynga can gear its own infrastructure to share services across games and employ other cost savings in the private cloud that can't be matched by a single game running in the public cloud. "I think the public cloud is something I would absolutely leverage," he advised. "But after you've established your application in the cloud, think about how you can change the op-ex into cap-ex"--build out your own infrastructure in a way that becomes the most efficient way to run the application, he said.
Amazon's Selipsky was put on the spot when asked whether Oracle CEO Larry Ellison was right when he charged that Salesforce.com's applications are unsafe because they run on Salesforce's multi-tenant infrastructure. Was that the case, he was asked by Matt Marshall, editor in chief of VentureBeat, or was that Oracle FUD.
Amazon has numerous Oracle offerings in its product catalogue, such as the Oracle database and Oracle applications but they too run in Amazon's multi-tenant architecture. Selipsky paused, then said: "Oracle is a great part of our offerings. ... All of the Oracle suite runs on AWS. I'll make that positive statement."
About the Author
You May Also Like