Amazon Disruption Produces Cloud Outage Spiral - InformationWeek
Cloud // Infrastructure as a Service
09:31 AM
Connect Directly

Amazon Disruption Produces Cloud Outage Spiral

Amazon DynamoDB failure early Sunday set off cascading slowdowns and service disruptions that illustrate the highly connected nature of cloud computing.

10 Cloud Storage Options For Your Personal Files
10 Cloud Storage Options For Your Personal Files
(Click image for larger view and slideshow.)

Amazon Web Services had a rare instance of a cascading set of disruptions that rippled through a core set of services early Sunday, September 20. The failure appears to have begun in DynamoDB and spread from there.

The incident fell well short of the disruptions and outright outages that afflicted Amazon over the Easter weekend of 2011. Many fewer customers seem to have been affected, and it appeared slowdowns or stalls were mainly limited to a data-oriented set of services. Nevertheless, this episode illustrates how in the automated cloud one service is linked to another in ways that the average customer may have a hard time predicting.

A number of Web companies, including AirBnB, IMDB, Pocket, Netflix, Tinder, and Buffer, were affected by the service slowdown and, in some cases, service disruption, according to comments that appeared on social media and on Fortune's website. The incident began at 3 a.m. PT Sunday, or 6 a.m. in the location where it had the greatest impact: Amazon's most heavily trafficked data center complex in Ashburn, Va., also known as US-East-1.

[Want to learn more about one solution to potential Amazon outages? See Amazon Outage: Multiple Zones A Smart Strategy.]

Just as much of Facebook runs on the NoSQL system Cassandra, much of Amazon depends on the unstructured data system it invented for its own operations, DynamoDB. AWS identified the DynamoDB metadata service as the incident's cause within an hour of its start. At 4:52 a.m. PT, about two hours after the incident began, the AWS Service Health Dashboard reported: "The root cause began with a portion of our metadata service within DynamoDB. This is an internal sub-service which manages table and partition information."

The metadata service controls the names of tables and partitions, the attributes of the table's primary key, and the table's read-write requirements, among other things. Such metadata is critical to how a NoSQL system functions, and how services that depend on DynamoDB function if something goes awry, as it turns out.

A total of 35 services, many of them in US-East-1, were affected by the DynamoDB outage. Twelve of them were only slowed or temporarily delayed; ten warranted the red, "some customers may experience an outage" symbol on the Service Health Watch dashboard; and thirteen received the yellow symbol warning of a persistent slowdown. AWS lists a total of 117 services.

(Image: kirstypargeter/iStockphoto)

(Image: kirstypargeter/iStockphoto)

The services that carried a red warning in Northern Virginia, in addition to DynamoDB, included: Amazon Email Service, Amazon Workspaces, Simple Queue Service, Lambda, Amazon CloudFormation, Simple Workflow Service, Simple Notification Service, Amazon CloudWatch, and Auto Scaling.

In its 4:52 a.m. PT notice, Amazon informed its customers it would need to stifle or "throttle" the activity of service APIs in order to work on DynamoDB's recovery. Prior to that, as early as 3:12 a.m. PT, Amazon's core EC2 service in Ashburn began to experience greater latencies and error rates. The loss of DynamoDB had a swift impact on the more general infrastructure. At 3:28 a.m. PT alarms went off that CloudWatch, AWS's monitoring service, was slowing down inside US-East-1.

Likewise, by 3:51 am there were increased error rates for invoking both Spot Instance APIs (the low-priced, demand-sensitive virtual servers offered by Amazon) and new EC2 instance launches. At no time did EC2 services cease functioning. It was simply a matter of increased latencies and outright delays in its operations.

At 4:23 a.m. PT, AWS's Service Health Dashboard warned CloudWatch users: "Customers may experience delayed or missing data points for CloudWatch metrics and CloudWatch alarms may transition into 'INSUFFICIENT_DATA' state if set on delayed metrics."

Shortly before 5 a.m. PT, the troubleshooting team reported through the dashboard that it was struggling to resolve the increased error rates for Elastic Block Store (EBS) APIs, which grant access to temporary storage for running instances. Nevertheless, Amazon reported that customers remained connected to their instances and EBS storage volumes' performance was normal, if customers could access them through the EBS API.

"We are experiencing increased error rates accessing virtual private cloud endpoints for S3 (Simple Storage Service)," the dashboard reported at 6:45 a.m. PT.

By 8:19 a.m. PT, the troubleshooting team reported that the issues affecting EC2 had been resolved and service was once again performing normally. The 5.3 hours of disruption early on a Sunday morning might be considered a less serious incident than one that would occur at the start of business on the East Coast. But the interdependency of services and the multiple-service dependency on DynamoDB should be a lesson that cloud users fully consider.

Amazon recommends (and best practices dictate) keeping mission-critical systems operating in two availability zones. But AWS incident reporting is by region, not availability zone. It would be hard to know whether a multi-zone strategy would have been sufficient in this most recent incident.

Netflix, which claimed it felt, but was not much affected, by the incident, can failover to an alternative Amazon region, such as US-West-Northern California, when US-East is heavily impacted by service failures. It can do so by maintaining duplicates of its systems in a second or third regional data center (and paying the fees required to keep them stored there), or in cases of hot standby, keeping them running and ready to take over. It's estimated that it adds 25% to the cost of a cloud system to do this.

AWS spokespeople were not available to comment further on Monday, Sept. 21.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
Charlie Babcock
Charlie Babcock,
User Rank: Author
10/2/2015 | 6:14:11 PM
Amazon's public statement on the disruption
Amazon produced a public statement on the outage but I did not receive it in time to include in this story. It was: Between 2:13 AM and 7:10 AM PDT on September 20, 2015, Amazon Web Services (AWS) experienced significant error rates with read and write operations for the Amazon DynamoDB service in the US-East Region, which impacted some other AWS services in that region, and caused some AWS customers to experience elevated error rates.  

Charlie Babcock
Charlie Babcock,
User Rank: Author
9/22/2015 | 6:05:54 PM
Cloud processes designed to protect your system may eat it alive
"Many of us are customers of Amazon without knowing it." Good point, Tom. The cloud providers still don't know how to get a circuit breaker into an automated process gone awry. Amazon's Easter weekend shutdown four years ago was triggered by human error, when an operations person unplugged a trunk network, then replugged it into a backup network. That made all the data sets visible on that trunk line disappear, which triggered a "remirroring storm" as the cloud software tried to make up for the lost data by creating new sets. While that was going on, everything else pretty much ground to a halt--for 2-3 days.
Thomas Claburn
Thomas Claburn,
User Rank: Author
9/22/2015 | 4:55:22 PM
Re: Cloud shutdown as an automated process
One company affected by the AWS issues was Scout Alarm. This apparently limited the ability of customers to arm and disarm their alarm systems. As we become more dependent on cloud services, let hope Amazon improves reliability. Many of us are AWS customers without knowing it.
Charlie Babcock
Charlie Babcock,
User Rank: Author
9/22/2015 | 4:34:44 PM
Cloud shutdown as an automated process
On Feb. 29, 2012, an expiration date for security certificates that didn't recognize a a leap year caused one virtual machine after another to fail upon its attempted start. Three failed starts in a row caused a host in Microsoft's Azure cloud to conclude that hardware was failing, when it wasn't, and move the faulty virtual machine to another server, where the error could repeat itself. In this manner does cloud software bring down the cloud as an unexpected automated process.   
Tech Vendors to Watch in 2019
Susan Fogarty, Editor in Chief,  11/13/2018
Getting DevOps Wrong: Top 5 Mistakes Organizations Make
Bill Kleyman, Writer/Blogger/Speaker,  11/2/2018
AI & Machine Learning: An Enterprise Guide
James M. Connolly, Executive Managing Editor, InformationWeekEditor in Chief,  9/27/2018
Register for InformationWeek Newsletters
Current Issue
The Next Generation of IT Support
The workforce is changing as businesses become global and technology erodes geographical and physical barriers.IT organizations are critical to enabling this transition and can utilize next-generation tools and strategies to provide world-class support regardless of location, platform or device
White Papers
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Sponsored Video
Flash Poll