Cloud Takes A Hit: Amazon Must Fix EC2 - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Cloud // Infrastructure as a Service
Commentary
4/22/2011
03:32 PM
Charles Babcock
Charles Babcock
Commentary
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Cloud Takes A Hit: Amazon Must Fix EC2

Amazon's "availability zones" were a key protective concept for the cloud, but they failed to protect access to data when EC2 went down.

It seems to me the outage of Amazon’s cloud computing service yesterday was a signal event. IT advocates of cloud computing face severe internal skepticism that the cloud is a reliable, distributed environment. In the past, they’ve responded that skilled service providers, such as Amazon, architect against failure with availability zones, independently running sections in one data center. If you run your application in one and keep a mirror image in another, you’re protected. Some enterprises found out yesterday the architecture doesn’t work. Their critics had a field day.

Amazon’s outage in Northern Virginia yesterday impeded customer access to data beyond one availability zone in that center. Amazon has a West Coast data center as well as one in Northern Virginia, but something that wasn’t clear before became clear yesterday. Amazon zones don’t extend to a different data centers in different geographic locations. This fact is reverberating today among users of cloud computing. The different availability zones are supposed to keep services running, even if part of the data center fails. They didn‘t function as advertised.

Amazon Web Services has been posting its usual terse explanations to its Service Watch Dashboard, but for the anxious IT manager they don't say much. They don't say, for example, when the cause of the trouble can be expected to be alleviated. Service troubles started at 5 minutes before 1 a.m. Pacific time on Thursday. At 11:09 a.m., the dashboard acknowledged many customers were asking when service would be back: "We deeply understand why this is important and promise to share this information as soon as we have an estimate that we believe is close to accurate." Their best guess: "in a few hours."

Let's be clear on what did and did not happen. Amazon's EC2 infrastructure as a service, the compute servers, stayed up and running in Northern Virginia, but some of them lost the ability to access data, launch a customer's stored instances, and save results of running instances. That means those customer servers or “instances” that were running time sensitive applications or customer facing apps were rendered useless.

On the other hand, some customers may not have been affected at all. CloudSleuth, an EC2 monitoring service from Compuware that's meant to illustrate the capabilities of its Gomez monitoring service, had two test applications running in Northern Virginia Thursday and they responded to pings indicating that they had stayed up and running through the outage. Neither of the test apps were making use of Relational Database Service or Elastic Block Store, key affected services. If they had needed them, they would have stalled.

A disruption to the RDS appears to have lead to interruptions of the EBS storage service that Amazon offers customers to capture data and record the application instance. The failure of these services in a zone of what's known as US-East 1, an Amazon data center in Northern Virgina, was bad enough, but their failure in turn triggered RDS and EBS service disruptions in additional availability zones.

Most enterprise applications in EC2 would be making use of EBS and some would use RDS as well. Their inability to access data would render them useless in many cases for the length of the service disruption. Until Amazon can demonstrate that it knows what caused the problem and how to fix it, this disruption puts a stake in the heart of the argument that Amazon zones are adequate protection against failure.

That's because Amazon itself presents the zones as the chief protection against your application failing. "By launching instances in separate Availability Zones, you can protect your applications from failure of a single location," states the guidance for users of Amazon Machine Images.

What is a zone? Only Amazon knows for sure. I know the new New York Stock Exchange data center in Mahwah, N.J., designed for high availability, was built on the border of two utility companies, giving it two sources of power. To me, a cloud data center has at least two zones with distinct electricity sources. One can fail, and the rest of the facility keeps running. Likewise, with telecommunication carriers, two or more are necessary. Zones within the data center tap into difference services; they're architected against both failing at the same time. Yesterday's outage, on the contrary, says zones are not insulated from one another and a service failure of one can spill over into another. This is a body blow to cloud computing.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
News
Top 10 Data and Analytics Trends for 2021
Jessica Davis, Senior Editor, Enterprise Apps,  11/13/2020
Commentary
Where Cloud Spending Might Grow in 2021 and Post-Pandemic
Joao-Pierre S. Ruth, Senior Writer,  11/19/2020
Slideshows
The Ever-Expanding List of C-Level Technology Positions
Cynthia Harvey, Freelance Journalist, InformationWeek,  11/10/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Why Chatbots Are So Popular Right Now
In this IT Trend Report, you will learn more about why chatbots are gaining traction within businesses, particularly while a pandemic is impacting the world.
Slideshows
Flash Poll