"A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1," the Service Health Watch dashboard said seven hours after the trouble began. "A networking event" could be anything, from the failure of a core switch in the data center to the activity of a backhoe outside it.
What this means to me is that a large amount of EBS storage suddenly became unavailable for unexplained reasons, and the Amazon cloud did what it was supposed to do. There were backup copies elsewhere and it started accessing them and creating a new copy of the data to replace the ones no longer available. It's an article of faith among cloud proponents that three copies of data on different storage elements will protect against an unrecoverable loss. Once one disappears, one of the remaining two becomes the go-to copy, and the system starts replicating a new third copy on different storage.
With the limited information available, yesterday's event appears to have triggered such a large data restructuring effort that EBS and RDS services in other zones were disrupted as well, not for a few minutes but for many hours. This is not equipment failure, but the architecture shooting itself in the foot.
Consider the following postings from the EC2 health dashboard:
2:18 a.m. Pacific: "We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones ..."
4:09 a.m. Pacific: "EBS volume latency and API errors have recovered in one of the two impacted Availability Zones in US-EAST-1. We are continuing to work to resolve the issues in the second impacted Availability Zone."
6:59 a.m. Pacific: "There has been a moderate increase in error rates for CreateVolume. This may impact the launch of new EBS-backed EC2 instances in multiple availability zones in the US-EAST-1 region."
My interpretation: Something triggered a demand for storage volumes that was so great that a running application wouldn't be able to get one when needed. If your server instance wasn't available in one zone, its backup might not be accessible in another. For some users, if your application was running and needed data, it probably couldn't get it.
Amazon may dispute this view, particularly the fact that the architecture itself was to blame for the delays and service losses that some customers experienced. Whatever the cause, Amazon needs to come up with both a succinct description of the cause and assurance that it's been addressed. There will be many skeptics listening, some doubting what it says.
Cloud computing suffered a setback, a loss of confidence, in this incident. It was 12 hours after the initial notice that customers were assured they could use availability zones other than the one that originated the outage. That one was still recovering 36 hours after the start of the incident. It will take a remedy and a lot of good experience by customers before enterprise IT will be convinced the cloud can be made reliable. A commenter on Slashdot.org, JBrodkin, noted Thursday: "The availability zones are close together and can fail at the same time, as we saw today. The outage and ongoing attempts to restore service call into question the effectiveness of availability zones."
Amazon's last post on EC2 service said its troubleshooters are "working this hard and will do so as long as it takes." I hope the cloud architects are chained to their desks as well and come up with a better design, one that ameliorates a crippling surge in traffic when a data protection feature kicks in and starts to run amuck.
Charles Babcock is an editor-at-large for InformationWeek.