Amazon Cloud Outage Proves Importance Of Failover Planning - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Cloud // Infrastructure as a Service
News
4/27/2011
10:13 PM
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Amazon Cloud Outage Proves Importance Of Failover Planning

Contingency plans kept Bizo and Mashery up and running during the Amazon service outage, offering lessons to other cloud-based businesses.

Flood watched the problems develop and knew he had to make a decision. His small firm consists mainly of eight engineers, none of them full-time systems operators. At 4 a.m. in Denver, Bizo traffic starts to grow on the East Coast as early risers check business publications and websites for the latest news. Bizo still had servers running in U.S. East-1, but Flood could see from the terse AWS information posts that he was unlikely to be able to launch more, which he would need at the start of the day to support a pending spike.

"It was not a decision I wanted to make," he says, "but Bizo supports thousands of websites," collecting data on the users visiting them and reporting to their owners what the traffic is doing that day. By 4:30 a.m., Flood was in touch with a Bizo partner, Dynect, which can direct or redirect Bizo traffic from one location to another through the Internet's Domain Name System. There was a 7.5 minute pause on Bizo's ability to service its traffic as Dynect technicians did the reconfiguration that told the DNS to redirect traffic from U.S. East-1 to Northern California.

"We decided at the start of the business day to funnel all our traffic to the West Coast" and avoid Amazon's problems. In doing so, Bizo maintained its service. Its ability to do so in the spur of the moment was based on a close relationship with Dynect that worked in the middle of the night. It was something that Amazon itself couldn't do, Flood realized.

If it hadn't done so, "we have been stuck with the number of instances currently running," a number set by the low traffic of early morning hours, Flood said. Bizo has multiple update services to support. Once morning traffic builds, "We'd have been stuck. We wouldn't have been able to spike up," he said.

Mashery, even more than Bizo, had a capability in place for what it would do in the event of an Amazon outage. The San Francisco firm helps engineer and monitor the APIs that tie a company service, such as the Netflix film download service, to its customers. CEO Michels said in an interview that the firm is responsible for the continued monitoring and operation of APIs for 25,000 running applications. The service is subscribed to by such customers as the New York Times, Hoover's and Best Buy, as well as Netflix.

"In our first year, we took the assumption that everything (in the cloud) is going to fail," and set up failover paths to an outside data center service supplier, InterNAP, Michels said.

"We architected so that everything could run, even if 'home' is suddenly not available," he said in an interview. The failover service was set up through a third-party DNS routing service, UltraDNS, and tested in advance. When the outage came, Mashery was ready and its monitoring and reporting traffic was rerouted from U.S. East-1 to the systems waiting at InterNAP. The failover functioned as expected.

"We were never unreachable," said Michels. The failover created a slowdown or pause in the Mashery operations that may have lasted up to two minutes, he said, "but that isn't like two days."

A note posted to the Amazon Services Health Dashboard April 24 said the three-day service outage will be fully explained in "a detailed post mortem." On April 27, AWS CTO Werner Vogels posted to his blog a 2010 letter that Amazon CEO Jeff Bezos wrote to shareholders, extolling AWS' technology innovation and commitment to customers. The post mortem is still pending.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
2 of 2
Next
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Slideshows
10 Things Your Artificial Intelligence Initiative Needs to Succeed
Lisa Morgan, Freelance Writer,  4/20/2021
News
Tech Spending Climbs as Digital Business Initiatives Grow
Jessica Davis, Senior Editor, Enterprise Apps,  4/22/2021
Commentary
Optimizing the CIO and CFO Relationship
Mary E. Shacklett, Mary E. Shacklett,  4/13/2021
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Successful Strategies for Digital Transformation
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Slideshows
Flash Poll