Amazon Cloud Outage Proves Importance Of Failover Planning
Contingency plans kept Bizo and Mashery up and running during the Amazon service outage, offering lessons to other cloud-based businesses.
In the aftermath of the Amazon cloud service outage last week, two San Francisco businesses that depend on Amazon's EC2, Bizo and Mashery, say it's possible to survive such a mishap without business disruption.
But in both cases, they had taken steps to protect their businesses. Bizo resorted to a practice that many observers were left wondering why Amazon itself hadn't adopted--the ability of a system in one data center to be shifted to another in a separate, geographic location.
Amazon's recommendation is for a customer to generate an instance of a server running a workload in one availability zone of its data center to have a carbon copy, perhaps running at the same time, in another. An availability zone has never been precisely defined by Amazon, but they are distinct operating sections within a data center. One zone is believed to have power and telecommunications services separate from other zones.
The best protection against an outage, according to Amazon guidance, is to establish a mirrored instance, running the same logic and data as the original. But doing so adds to the cost of cloud computing. You're paying for two server instances instead of one. You must also pay by the gigabyte to move data from one availability zone to another.
Those who incur these charges believe they have set up protection for themselves in the event of an outage in their primary zone. But in the early morning hours of April 21, as the Amazon Elastic Block Store (EBS) and Relational Database Service (RDS) began to fail in one availability zone of Amazon's Northern Virginia U.S. East-1 data center, they faltered and also began to fail in the three others.
Oren Michels, CEO of Mashery, and Donnie Flood, VP of engineering at Bizo, know all about that set of failures. They had taken Amazon's recommended steps, but fortunately they were also able to take additional steps beyond Amazon's recommendations.
Flood said Bizo's Web-based business marketing platform uses both U.S. East-1 and Amazon's second North American data center in Northern California. As a matter of fact, Bizo uses two availability zones in each center to protect against an outage.
On April 21, Flood was on a trip and asleep in Denver when his phone started issuing alerts around 2:30 a.m. Rocky Mountain Time. Thirty-five minutes earlier, the RDS and EBS services that power the Bizo applications in U.S. East-1 had started having problems and the AWS Services Health Dashboard was about to issue its first notice of something going awry.
Flood couldn't at first believe that one set of failures was serious but the alerts continued to pour in with disturbing regularity. U.S. East-1 is an important data center to Bizo because it hosts more traffic there than in Northern California. As best as Flood could tell in the middle of the night, the problem that started in one of the data center's availability zones was spreading, impairing Bizo's operations.
"U.S. East is our main region. I was surprised by the spread of trouble into the additional zones. That goes against what is expected," said Flood in an interview.
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
. We've got a management crisis right now, and we've also got an engagement crisis. Could the two be linked? Tune in for the next installment of IT Life Radio, Wednesday May 20th at 3PM ET to find out.