Both Netflix and Zynga depend on Amazon's public infrastructure as a service for a critical part of their operations. Netflix uses the Elastic Compute Cloud to convert analog copies of old films into digital content that can be streamed over the Internet to customers. Zynga launches a game on Amazon, plots its demand curve, and brings it in-house only when it's clear it won't lose subscribers by failing to meet early demand. The auto-scaling feature of Amazon enables both companies to cope with large fluctuations in demand.
When Amazon experienced what it termed a "re-mirroring storm" in April, data access and instance launch services were tied up in its Northern Virginia data center that serves as a primary site for Netflix operations. Adrian Cockcroft, director of cloud systems architecture for Netflix, said his firm had architected its systems in EC2 for failure and spread them across three availability zones--the equivalent of three separate data centers in Amazon parlance.
But the Easter weekend outage was so pervasive that services that failed in one availability zone tied up services in other zones, according to some companies, such as the Mashery, a San Francisco firm that manages high traffic APIs for other companies.
[ To learn more, see Cloud Connect: Netflix Finds A Home In The Amazon Cloud ]
Randy Bias, CTO of Cloudscaling, a builder of clouds for service providers, moderated an appearance by Cockcroft at CloudBeat on Wednesday. CloudBeat is a two-day conference on cloud computing in Redwood City, Calif., sponsored by the venture capital-oriented website, VentureBeat. After the Easter weekend outage, Netflix said its operations were unaffected. "What was the actual impact on Netflix?" Bias asked at CloudBeat, and Cockcroft elaborated.
"Everyone [in Netflix IT] was complaining. We were seeing a higher error rate, which we did something about," he recounted. As Amazon customers such as Netflix attempted to launch virtual servers or retrieve data on April 22, the launches would fail due to the extreme degree to which Amazon's compute resources were tied up coping with the remirroring storm. Netflix existing servers continued running, and by the second day, Netflix decided to move its infrastructure out of the worst affected zone into the two that retained more of their operational capacity.
"One zone was primarily affect. We vacated the affected zone. We had to do that manually," Cockcroft recalled.
"Amazon doesn't do that for you?" Bias asked.
"No," Cockcroft answered. But Netflix had understood the risk going into the public cloud. "Everything can go away at any time. ... Our backend systems are highly redundant across three availability zones. Think of an availability zone as a separate data center that's one more millisecond in latency away," he said.
Cockcroft said Netflix is confident enough of public cloud infrastructure that it is now conducting its movie digitization and downloading movies to customers from inside Amazon, with core corporate systems still running in Netflix data centers. It stopped building new data centers in 2009 and began a shift into Amazon infrastructure in 2010 to cope with a rapid build-up of its subscriber base.
Executing a strategy where the cloud plays an increasingly important role in your company's strategy alongside its own data center is like riding two horses side by side, "with a foot in one stirrup for each one," Cockcroft said.
Adam Selipsky, VP of marketing at Amazon Web Services, preceded Cockcroft to the stage at CloudBeat and commented, "Netflix is wonderfully architected. They purposely blow away parts of a system to make sure they don't have points of failure. They came through without downtime." Netflix has introduced the concept of the Chaos Monkey to its developers and told them to build redundancy into their systems. When the Chaos Monkey is unleashed, selected parts of a system are disrupted to see whether it can continue operating as a whole.