IT Infrastructure

Amazon EC2 Outage Hobbles Websites

Engine Yard, Foursquare, Hootsuite, Heroku, Quora, and Reddit were among the websites that suffered from slowed or disabled access.

Thomas Claburn, Editor at Large, Enterprise Mobility

April 21, 2011

4 Min Read

Slideshow: Amazon's Case For Enterprise Cloud Computing

Slideshow: Amazon's Case For Enterprise Cloud Computing (click image for larger view and for full slideshow)

Amazon Web Services' Elastic Compute Cloud, which offers computation as a service to thousands of businesses, and its Relational Database Service, began experiencing errors shortly before 2 a.m. PDT on Thursday at Amazon's US-EAST data center in Virginia and the service interruption has been ongoing for more than nine hours now.

The technical problems have slowed or disabled access to the websites of customers utilizing AWS US-East resources, including Engine Yard, Foursquare, Hootsuite, Heroku, Quora, and Reddit, to name a few.

Shortly before noon PDT on Thursday, Reddit displayed a notice saying the discussion site is "is in 'emergency read-only mode' right now because Amazon is experiencing a degradation. They are working on it but we are still waiting for them to get to our volumes."

Hootsuite, Foursquare, and Quora displayed similar messages, while Heroku was inaccessible.

Amazon did not respond to a request for comment.

Engine Yard, a Ruby on Rails cloud service provider, was affected by the AWS outage, but Mike Piech, VP product management and marketing, said in an interview that the company weathered the storm because its business revolves around adding value to Amazon's cloud. As a service provider itself, the company has been working to limit the impact of a possible outage on clients by utilizing multiple Amazon data centers.

Engine Yard has been running EC2 instances exclusively out of Amazon's US-EAST facility, but the company has been beta testing multi-region availability to mitigate the risk of an outage. The goal is to host EC2 instances out of AWS facilities on the West Coast, in Europe, and Asia. As a result of the outage, Engine Yard accelerated its availability in other regions to help affected clients.

Piech insisted that hardware problems happen and the incident has not affected his company's interest in working with AWS.

The outage lit up the AWS customer support forum. An individual posting under the name "elephantdrive," which also is the name of a cloud storage service running atop AWS, echoed the frustration expressed by many other forum users that communication about the outage has been inadequate.

"We certainly understand that no operational infrastructure will be immune from downtime," said the person posting under the name elephantdrive. "We just want some estimate as to when the issue will be resolved. The Health page describes a problem and steps to resolution, but provides no estimates. We need some information to try to make business decisions."

Indeed, the sentiment expressed by many AWS customers is that the issue isn't so much about downtime, which happens, as it is about inadequate communication about the downtime.

Yet not everyone was so sanguine about the cloud. Jimmy Tam, general manager of Peer Software, a data backup and enterprise collaboration company, argued in an interview that outsourcing IT infrastructure to cloud service providers isn't the right choice for a lot of enterprise customers.

He cited global network performance as a major issue. "The cloud can be good for offices that have great bandwidth, but a lot of areas in the world don't have that," he said.

Tam pointed to one of his company's clients, a swimwear company that creates its designs in Los Angeles and runs its production in China. Getting design files uploaded and downloaded can take hours, he said, owning to the large file sizes and poor network bandwidth. "The cloud doesn't have sophisticated design software," he said. "You design on the desktop."

Outages like the one experienced by AWS present problems too. "If the file is local, I'm not worried about lost Internet connectivity," he said. "If you have an outage, that means everybody who is connected to the cloud can't have access." And he also insisted that data loss remains a possibility.

And he pointed to the risk that cloud service providers may choose to discontinue certain services, as Iron Mountain recently did. That leaves IT teams scrambling to come up with alternatives. "The cloud in theory is great," he said. But I don't think any cloud provider has solved all of these issues."

As of 10:35 a.m. PDT, Amazon finally had some good news to share. "We are making progress on restoring access and IO latencies for affected RDS instances," the company said. "We recommend that you do not attempt to recover using Reboot or Restore database instance APIs or try to create a new user snapshot for your RDS instance--currently those requests are not being processed."

However, the outage looks as if it will trigger service credit under Amazon's 99.95% Service Level Agreement. With 8,760 hours in a year, AWS can be inaccessible for 4.38 hours annually under that agreement.

AWS's S3 service suffered an eight hour failure back in July 2008. At the time, the company said that "any downtime is unacceptable and we won't be satisfied until [AWS] is perfect."

About the Author

Thomas Claburn

Editor at Large, Enterprise Mobility

Thomas Claburn has been writing about business and technology since 1996, for publications such as New Architect, PC Computing, InformationWeek, Salon, Wired, and Ziff Davis Smart Business. Before that, he worked in film and television, having earned a not particularly useful master's degree in film production. He wrote the original treatment for 3DO's Killing Time, a short story that appeared in On Spec, and the screenplay for an independent film called The Hanged Man, which he would later direct. He's the author of a science fiction novel, Reflecting Fires, and a sadly neglected blog, Lot 49. His iPhone game, Blocfall, is available through the iTunes App Store. His wife is a talented jazz singer; he does not sing, which is for the best.

See more from Thomas Claburn

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

Amazon EC2 Outage Hobbles Websites

About the Author

Editor's Choice

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

<span class="ArticleBase-LargeTitle">Amazon EC2 Outage Hobbles Websites</span>Amazon EC2 Outage Hobbles Websites

About the Author

Editor's Choice

Amazon EC2 Outage Hobbles Websites