Amazon Cloud Outage Proves Importance Of Failover Planning
Contingency plans kept Bizo and Mashery up and running during the Amazon service outage, offering lessons to other cloud-based businesses.
In the aftermath of the Amazon cloud service outage last week, two San Francisco businesses that depend on Amazon's EC2, Bizo and Mashery, say it's possible to survive such a mishap without business disruption.
But in both cases, they had taken steps to protect their businesses. Bizo resorted to a practice that many observers were left wondering why Amazon itself hadn't adopted--the ability of a system in one data center to be shifted to another in a separate, geographic location.
Amazon's recommendation is for a customer to generate an instance of a server running a workload in one availability zone of its data center to have a carbon copy, perhaps running at the same time, in another. An availability zone has never been precisely defined by Amazon, but they are distinct operating sections within a data center. One zone is believed to have power and telecommunications services separate from other zones.
The best protection against an outage, according to Amazon guidance, is to establish a mirrored instance, running the same logic and data as the original. But doing so adds to the cost of cloud computing. You're paying for two server instances instead of one. You must also pay by the gigabyte to move data from one availability zone to another.
Those who incur these charges believe they have set up protection for themselves in the event of an outage in their primary zone. But in the early morning hours of April 21, as the Amazon Elastic Block Store (EBS) and Relational Database Service (RDS) began to fail in one availability zone of Amazon's Northern Virginia U.S. East-1 data center, they faltered and also began to fail in the three others.
Oren Michels, CEO of Mashery, and Donnie Flood, VP of engineering at Bizo, know all about that set of failures. They had taken Amazon's recommended steps, but fortunately they were also able to take additional steps beyond Amazon's recommendations.
Flood said Bizo's Web-based business marketing platform uses both U.S. East-1 and Amazon's second North American data center in Northern California. As a matter of fact, Bizo uses two availability zones in each center to protect against an outage.
On April 21, Flood was on a trip and asleep in Denver when his phone started issuing alerts around 2:30 a.m. Rocky Mountain Time. Thirty-five minutes earlier, the RDS and EBS services that power the Bizo applications in U.S. East-1 had started having problems and the AWS Services Health Dashboard was about to issue its first notice of something going awry.
Flood couldn't at first believe that one set of failures was serious but the alerts continued to pour in with disturbing regularity. U.S. East-1 is an important data center to Bizo because it hosts more traffic there than in Northern California. As best as Flood could tell in the middle of the night, the problem that started in one of the data center's availability zones was spreading, impairing Bizo's operations.
"U.S. East is our main region. I was surprised by the spread of trouble into the additional zones. That goes against what is expected," said Flood in an interview. Flood watched the problems develop and knew he had to make a decision. His small firm consists mainly of eight engineers, none of them full-time systems operators. At 4 a.m. in Denver, Bizo traffic starts to grow on the East Coast as early risers check business publications and websites for the latest news. Bizo still had servers running in U.S. East-1, but Flood could see from the terse AWS information posts that he was unlikely to be able to launch more, which he would need at the start of the day to support a pending spike.
"It was not a decision I wanted to make," he says, "but Bizo supports thousands of websites," collecting data on the users visiting them and reporting to their owners what the traffic is doing that day. By 4:30 a.m., Flood was in touch with a Bizo partner, Dynect, which can direct or redirect Bizo traffic from one location to another through the Internet's Domain Name System. There was a 7.5 minute pause on Bizo's ability to service its traffic as Dynect technicians did the reconfiguration that told the DNS to redirect traffic from U.S. East-1 to Northern California.
"We decided at the start of the business day to funnel all our traffic to the West Coast" and avoid Amazon's problems. In doing so, Bizo maintained its service. Its ability to do so in the spur of the moment was based on a close relationship with Dynect that worked in the middle of the night. It was something that Amazon itself couldn't do, Flood realized.
If it hadn't done so, "we have been stuck with the number of instances currently running," a number set by the low traffic of early morning hours, Flood said. Bizo has multiple update services to support. Once morning traffic builds, "We'd have been stuck. We wouldn't have been able to spike up," he said.
Mashery, even more than Bizo, had a capability in place for what it would do in the event of an Amazon outage. The San Francisco firm helps engineer and monitor the APIs that tie a company service, such as the Netflix film download service, to its customers. CEO Michels said in an interview that the firm is responsible for the continued monitoring and operation of APIs for 25,000 running applications. The service is subscribed to by such customers as the New York Times, Hoover's and Best Buy, as well as Netflix.
"In our first year, we took the assumption that everything (in the cloud) is going to fail," and set up failover paths to an outside data center service supplier, InterNAP, Michels said.
"We architected so that everything could run, even if 'home' is suddenly not available," he said in an interview. The failover service was set up through a third-party DNS routing service, UltraDNS, and tested in advance. When the outage came, Mashery was ready and its monitoring and reporting traffic was rerouted from U.S. East-1 to the systems waiting at InterNAP. The failover functioned as expected.
"We were never unreachable," said Michels. The failover created a slowdown or pause in the Mashery operations that may have lasted up to two minutes, he said, "but that isn't like two days."
A note posted to the Amazon Services Health Dashboard April 24 said the three-day service outage will be fully explained in "a detailed post mortem." On April 27, AWS CTO Werner Vogels posted to his blog a 2010 letter that Amazon CEO Jeff Bezos wrote to shareholders, extolling AWS' technology innovation and commitment to customers. The post mortem is still pending.
About the Author
You May Also Like