Amazon Cloud Outage Proves Importance Of Failover Planning
Contingency plans kept Bizo and Mashery up and running during the Amazon service outage, offering lessons to other cloud-based businesses.
Flood watched the problems develop and knew he had to make a decision. His small firm consists mainly of eight engineers, none of them full-time systems operators. At 4 a.m. in Denver, Bizo traffic starts to grow on the East Coast as early risers check business publications and websites for the latest news. Bizo still had servers running in U.S. East-1, but Flood could see from the terse AWS information posts that he was unlikely to be able to launch more, which he would need at the start of the day to support a pending spike.
"It was not a decision I wanted to make," he says, "but Bizo supports thousands of websites," collecting data on the users visiting them and reporting to their owners what the traffic is doing that day. By 4:30 a.m., Flood was in touch with a Bizo partner, Dynect, which can direct or redirect Bizo traffic from one location to another through the Internet's Domain Name System. There was a 7.5 minute pause on Bizo's ability to service its traffic as Dynect technicians did the reconfiguration that told the DNS to redirect traffic from U.S. East-1 to Northern California.
"We decided at the start of the business day to funnel all our traffic to the West Coast" and avoid Amazon's problems. In doing so, Bizo maintained its service. Its ability to do so in the spur of the moment was based on a close relationship with Dynect that worked in the middle of the night. It was something that Amazon itself couldn't do, Flood realized.
If it hadn't done so, "we have been stuck with the number of instances currently running," a number set by the low traffic of early morning hours, Flood said. Bizo has multiple update services to support. Once morning traffic builds, "We'd have been stuck. We wouldn't have been able to spike up," he said.
Mashery, even more than Bizo, had a capability in place for what it would do in the event of an Amazon outage. The San Francisco firm helps engineer and monitor the APIs that tie a company service, such as the Netflix film download service, to its customers. CEO Michels said in an interview that the firm is responsible for the continued monitoring and operation of APIs for 25,000 running applications. The service is subscribed to by such customers as the New York Times, Hoover's and Best Buy, as well as Netflix.
"In our first year, we took the assumption that everything (in the cloud) is going to fail," and set up failover paths to an outside data center service supplier, InterNAP, Michels said.
"We architected so that everything could run, even if 'home' is suddenly not available," he said in an interview. The failover service was set up through a third-party DNS routing service, UltraDNS, and tested in advance. When the outage came, Mashery was ready and its monitoring and reporting traffic was rerouted from U.S. East-1 to the systems waiting at InterNAP. The failover functioned as expected.
"We were never unreachable," said Michels. The failover created a slowdown or pause in the Mashery operations that may have lasted up to two minutes, he said, "but that isn't like two days."
A note posted to the Amazon Services Health Dashboard April 24 said the three-day service outage will be fully explained in "a detailed post mortem." On April 27, AWS CTO Werner Vogels posted to his blog a 2010 letter that Amazon CEO Jeff Bezos wrote to shareholders, extolling AWS' technology innovation and commitment to customers. The post mortem is still pending.
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.