Amazon Web Services had a rare instance of a cascading set of disruptions that rippled through a core set of services early Sunday, September 20. The failure appears to have begun in DynamoDB and spread from there.
The incident fell well short of the disruptions and outright outages that afflicted Amazon over the Easter weekend of 2011. Many fewer customers seem to have been affected, and it appeared slowdowns or stalls were mainly limited to a data-oriented set of services. Nevertheless, this episode illustrates how in the automated cloud one service is linked to another in ways that the average customer may have a hard time predicting.
A number of Web companies, including AirBnB, IMDB, Pocket, Netflix, Tinder, and Buffer, were affected by the service slowdown and, in some cases, service disruption, according to comments that appeared on social media and on Fortune's website. The incident began at 3 a.m. PT Sunday, or 6 a.m. in the location where it had the greatest impact: Amazon's most heavily trafficked data center complex in Ashburn, Va., also known as US-East-1.
[Want to learn more about one solution to potential Amazon outages? See Amazon Outage: Multiple Zones A Smart Strategy.]
Just as much of Facebook runs on the NoSQL system Cassandra, much of Amazon depends on the unstructured data system it invented for its own operations, DynamoDB. AWS identified the DynamoDB metadata service as the incident's cause within an hour of its start. At 4:52 a.m. PT, about two hours after the incident began, the AWS Service Health Dashboard reported: "The root cause began with a portion of our metadata service within DynamoDB. This is an internal sub-service which manages table and partition information."
The metadata service controls the names of tables and partitions, the attributes of the table's primary key, and the table's read-write requirements, among other things. Such metadata is critical to how a NoSQL system functions, and how services that depend on DynamoDB function if something goes awry, as it turns out.
A total of 35 services, many of them in US-East-1, were affected by the DynamoDB outage. Twelve of them were only slowed or temporarily delayed; ten warranted the red, "some customers may experience an outage" symbol on the Service Health Watch dashboard; and thirteen received the yellow symbol warning of a persistent slowdown. AWS lists a total of 117 services.
The services that carried a red warning in Northern Virginia, in addition to DynamoDB, included: Amazon Email Service, Amazon Workspaces, Simple Queue Service, Lambda, Amazon CloudFormation, Simple Workflow Service, Simple Notification Service, Amazon CloudWatch, and Auto Scaling.
In its 4:52 a.m. PT notice, Amazon informed its customers it would need to stifle or "throttle" the activity of service APIs in order to work on DynamoDB's recovery. Prior to that, as early as 3:12 a.m. PT, Amazon's core EC2 service in Ashburn began to experience greater latencies and error rates. The loss of DynamoDB had a swift impact on the more general infrastructure. At 3:28 a.m. PT alarms went off that CloudWatch, AWS's monitoring service, was slowing down inside US-East-1.
Likewise, by 3:51 am there were increased error rates for invoking both Spot Instance APIs (the low-priced, demand-sensitive virtual servers offered by Amazon) and new EC2 instance launches. At no time did EC2 services cease functioning. It was simply a matter of increased latencies and outright delays in its operations.
At 4:23 a.m. PT, AWS's Service Health Dashboard warned CloudWatch users: "Customers may experience delayed or missing data points for CloudWatch metrics and CloudWatch alarms may transition into 'INSUFFICIENT_DATA' state if set on delayed metrics."
Shortly before 5 a.m. PT, the troubleshooting team reported through the dashboard that it was struggling to resolve the increased error rates for Elastic Block Store (EBS) APIs, which grant access to temporary storage for running instances. Nevertheless, Amazon reported that customers remained connected to their instances and EBS storage volumes' performance was normal, if customers could access them through the EBS API.
"We are experiencing increased error rates accessing virtual private cloud endpoints for S3 (Simple Storage Service)," the dashboard reported at 6:45 a.m. PT.
By 8:19 a.m. PT, the troubleshooting team reported that the issues affecting EC2 had been resolved and service was once again performing normally. The 5.3 hours of disruption early on a Sunday morning might be considered a less serious incident than one that would occur at the start of business on the East Coast. But the interdependency of services and the multiple-service dependency on DynamoDB should be a lesson that cloud users fully consider.
Amazon recommends (and best practices dictate) keeping mission-critical systems operating in two availability zones. But AWS incident reporting is by region, not availability zone. It would be hard to know whether a multi-zone strategy would have been sufficient in this most recent incident.
Netflix, which claimed it felt, but was not much affected, by the incident, can failover to an alternative Amazon region, such as US-West-Northern California, when US-East is heavily impacted by service failures. It can do so by maintaining duplicates of its systems in a second or third regional data center (and paying the fees required to keep them stored there), or in cases of hot standby, keeping them running and ready to take over. It's estimated that it adds 25% to the cost of a cloud system to do this.
AWS spokespeople were not available to comment further on Monday, Sept. 21.Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio