Amazon Disruption Produces Cloud Outage Spiral - InformationWeek
IoT
IoT
Cloud // Infrastructure as a Service
News
9/22/2015
09:31 AM
Connect Directly
Twitter
RSS
E-Mail
50%
50%
RELATED EVENTS
[Cybersecurity] Costs, Risks, & Benefits
Feb 28, 2017
How much should your organization spend on information security? What's the potential cost of a ma ...Read More>>

Amazon Disruption Produces Cloud Outage Spiral

Amazon DynamoDB failure early Sunday set off cascading slowdowns and service disruptions that illustrate the highly connected nature of cloud computing.

10 Cloud Storage Options For Your Personal Files
10 Cloud Storage Options For Your Personal Files
(Click image for larger view and slideshow.)

Amazon Web Services had a rare instance of a cascading set of disruptions that rippled through a core set of services early Sunday, September 20. The failure appears to have begun in DynamoDB and spread from there.

The incident fell well short of the disruptions and outright outages that afflicted Amazon over the Easter weekend of 2011. Many fewer customers seem to have been affected, and it appeared slowdowns or stalls were mainly limited to a data-oriented set of services. Nevertheless, this episode illustrates how in the automated cloud one service is linked to another in ways that the average customer may have a hard time predicting.

A number of Web companies, including AirBnB, IMDB, Pocket, Netflix, Tinder, and Buffer, were affected by the service slowdown and, in some cases, service disruption, according to comments that appeared on social media and on Fortune's website. The incident began at 3 a.m. PT Sunday, or 6 a.m. in the location where it had the greatest impact: Amazon's most heavily trafficked data center complex in Ashburn, Va., also known as US-East-1.

[Want to learn more about one solution to potential Amazon outages? See Amazon Outage: Multiple Zones A Smart Strategy.]

Just as much of Facebook runs on the NoSQL system Cassandra, much of Amazon depends on the unstructured data system it invented for its own operations, DynamoDB. AWS identified the DynamoDB metadata service as the incident's cause within an hour of its start. At 4:52 a.m. PT, about two hours after the incident began, the AWS Service Health Dashboard reported: "The root cause began with a portion of our metadata service within DynamoDB. This is an internal sub-service which manages table and partition information."

The metadata service controls the names of tables and partitions, the attributes of the table's primary key, and the table's read-write requirements, among other things. Such metadata is critical to how a NoSQL system functions, and how services that depend on DynamoDB function if something goes awry, as it turns out.

A total of 35 services, many of them in US-East-1, were affected by the DynamoDB outage. Twelve of them were only slowed or temporarily delayed; ten warranted the red, "some customers may experience an outage" symbol on the Service Health Watch dashboard; and thirteen received the yellow symbol warning of a persistent slowdown. AWS lists a total of 117 services.

(Image: kirstypargeter/iStockphoto)

(Image: kirstypargeter/iStockphoto)

The services that carried a red warning in Northern Virginia, in addition to DynamoDB, included: Amazon Email Service, Amazon Workspaces, Simple Queue Service, Lambda, Amazon CloudFormation, Simple Workflow Service, Simple Notification Service, Amazon CloudWatch, and Auto Scaling.

In its 4:52 a.m. PT notice, Amazon informed its customers it would need to stifle or "throttle" the activity of service APIs in order to work on DynamoDB's recovery. Prior to that, as early as 3:12 a.m. PT, Amazon's core EC2 service in Ashburn began to experience greater latencies and error rates. The loss of DynamoDB had a swift impact on the more general infrastructure. At 3:28 a.m. PT alarms went off that CloudWatch, AWS's monitoring service, was slowing down inside US-East-1.

Likewise, by 3:51 am there were increased error rates for invoking both Spot Instance APIs (the low-priced, demand-sensitive virtual servers offered by Amazon) and new EC2 instance launches. At no time did EC2 services cease functioning. It was simply a matter of increased latencies and outright delays in its operations.

At 4:23 a.m. PT, AWS's Service Health Dashboard warned CloudWatch users: "Customers may experience delayed or missing data points for CloudWatch metrics and CloudWatch alarms may transition into 'INSUFFICIENT_DATA' state if set on delayed metrics."

Shortly before 5 a.m. PT, the troubleshooting team reported through the dashboard that it was struggling to resolve the increased error rates for Elastic Block Store (EBS) APIs, which grant access to temporary storage for running instances. Nevertheless, Amazon reported that customers remained connected to their instances and EBS storage volumes' performance was normal, if customers could access them through the EBS API.

"We are experiencing increased error rates accessing virtual private cloud endpoints for S3 (Simple Storage Service)," the dashboard reported at 6:45 a.m. PT.

By 8:19 a.m. PT, the troubleshooting team reported that the issues affecting EC2 had been resolved and service was once again performing normally. The 5.3 hours of disruption early on a Sunday morning might be considered a less serious incident than one that would occur at the start of business on the East Coast. But the interdependency of services and the multiple-service dependency on DynamoDB should be a lesson that cloud users fully consider.

Amazon recommends (and best practices dictate) keeping mission-critical systems operating in two availability zones. But AWS incident reporting is by region, not availability zone. It would be hard to know whether a multi-zone strategy would have been sufficient in this most recent incident.

Netflix, which claimed it felt, but was not much affected, by the incident, can failover to an alternative Amazon region, such as US-West-Northern California, when US-East is heavily impacted by service failures. It can do so by maintaining duplicates of its systems in a second or third regional data center (and paying the fees required to keep them stored there), or in cases of hot standby, keeping them running and ready to take over. It's estimated that it adds 25% to the cost of a cloud system to do this.

AWS spokespeople were not available to comment further on Monday, Sept. 21.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
10/2/2015 | 6:14:11 PM
Amazon's public statement on the disruption
Amazon produced a public statement on the outage but I did not receive it in time to include in this story. It was: Between 2:13 AM and 7:10 AM PDT on September 20, 2015, Amazon Web Services (AWS) experienced significant error rates with read and write operations for the Amazon DynamoDB service in the US-East Region, which impacted some other AWS services in that region, and caused some AWS customers to experience elevated error rates.  

 
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
9/22/2015 | 6:05:54 PM
Cloud processes designed to protect your system may eat it alive
"Many of us are customers of Amazon without knowing it." Good point, Tom. The cloud providers still don't know how to get a circuit breaker into an automated process gone awry. Amazon's Easter weekend shutdown four years ago was triggered by human error, when an operations person unplugged a trunk network, then replugged it into a backup network. That made all the data sets visible on that trunk line disappear, which triggered a "remirroring storm" as the cloud software tried to make up for the lost data by creating new sets. While that was going on, everything else pretty much ground to a halt--for 2-3 days.
Thomas Claburn
50%
50%
Thomas Claburn,
User Rank: Author
9/22/2015 | 4:55:22 PM
Re: Cloud shutdown as an automated process
One company affected by the AWS issues was Scout Alarm. This apparently limited the ability of customers to arm and disarm their alarm systems. As we become more dependent on cloud services, let hope Amazon improves reliability. Many of us are AWS customers without knowing it.
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
9/22/2015 | 4:34:44 PM
Cloud shutdown as an automated process
On Feb. 29, 2012, an expiration date for security certificates that didn't recognize a a leap year caused one virtual machine after another to fail upon its attempted start. Three failed starts in a row caused a host in Microsoft's Azure cloud to conclude that hardware was failing, when it wasn't, and move the faulty virtual machine to another server, where the error could repeat itself. In this manner does cloud software bring down the cloud as an unexpected automated process.   
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
Annual IT Salary Report 
Base pay for IT professionals has remained flat this year with a median annual salary of $88,000 for staff and $112,000 for management. However, 58% of staff and 62% of managers who responded to our survey say they're satisfied with their compensation. Download this report to find out which positions earn the highest compensation.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of November 6, 2016. We'll be talking with the InformationWeek.com editors and correspondents who brought you the top stories of the week to get the "story behind the story."
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll