Amazon's "availability zones" were a key protective concept for the cloud, but they failed to protect access to data when EC2 went down.
It seems to me the outage of Amazon’s cloud computing service yesterday was a signal event. IT advocates of cloud computing face severe internal skepticism that the cloud is a reliable, distributed environment. In the past, they’ve responded that skilled service providers, such as Amazon, architect against failure with availability zones, independently running sections in one data center. If you run your application in one and keep a mirror image in another, you’re protected. Some enterprises found out yesterday the architecture doesn’t work. Their critics had a field day.
Amazon’s outage in Northern Virginia yesterday impeded customer access to data beyond one availability zone in that center. Amazon has a West Coast data center as well as one in Northern Virginia, but something that wasn’t clear before became clear yesterday. Amazon zones don’t extend to a different data centers in different geographic locations. This fact is reverberating today among users of cloud computing. The different availability zones are supposed to keep services running, even if part of the data center fails. They didn‘t function as advertised.
Amazon Web Services has been posting its usual terse explanations to its Service Watch Dashboard, but for the anxious IT manager they don't say much. They don't say, for example, when the cause of the trouble can be expected to be alleviated. Service troubles started at 5 minutes before 1 a.m. Pacific time on Thursday. At 11:09 a.m., the dashboard acknowledged many customers were asking when service would be back: "We deeply understand why this is important and promise to share this information as soon as we have an estimate that we believe is close to accurate." Their best guess: "in a few hours."
Let's be clear on what did and did not happen. Amazon's EC2 infrastructure as a service, the compute servers, stayed up and running in Northern Virginia, but some of them lost the ability to access data, launch a customer's stored instances, and save results of running instances. That means those customer servers or “instances” that were running time sensitive applications or customer facing apps were rendered useless.
On the other hand, some customers may not have been affected at all. CloudSleuth, an EC2 monitoring service from Compuware that's meant to illustrate the capabilities of its Gomez monitoring service, had two test applications running in Northern Virginia Thursday and they responded to pings indicating that they had stayed up and running through the outage. Neither of the test apps were making use of Relational Database Service or Elastic Block Store, key affected services. If they had needed them, they would have stalled.
A disruption to the RDS appears to have lead to interruptions of the EBS storage service that Amazon offers customers to capture data and record the application instance. The failure of these services in a zone of what's known as US-East 1, an Amazon data center in Northern Virgina, was bad enough, but their failure in turn triggered RDS and EBS service disruptions in additional availability zones.
Most enterprise applications in EC2 would be making use of EBS and some would use RDS as well. Their inability to access data would render them useless in many cases for the length of the service disruption. Until Amazon can demonstrate that it knows what caused the problem and how to fix it, this disruption puts a stake in the heart of the argument that Amazon zones are adequate protection against failure.
That's because Amazon itself presents the zones as the chief protection against your application failing. "By launching instances in separate Availability Zones, you can protect your applications from failure of a single location," states the guidance for users of Amazon Machine Images.
What is a zone? Only Amazon knows for sure. I know the new New York Stock Exchange data center in Mahwah, N.J., designed for high availability, was built on the border of two utility companies, giving it two sources of power. To me, a cloud data center has at least two zones with distinct electricity sources. One can fail, and the rest of the facility keeps running. Likewise, with telecommunication carriers, two or more are necessary. Zones within the data center tap into difference services; they're architected against both failing at the same time. Yesterday's outage, on the contrary, says zones are not insulated from one another and a service failure of one can spill over into another. This is a body blow to cloud computing.
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
. We've got a management crisis right now, and we've also got an engagement crisis. Could the two be linked? Tune in for the next installment of IT Life Radio, Wednesday May 20th at 3PM ET to find out.