Engine Yard, Foursquare, Hootsuite, Heroku, Quora, and Reddit were among the websites that suffered from slowed or disabled access.
Slideshow: Amazon's Case For Enterprise Cloud Computing
(click image for larger view and for full slideshow)
Amazon Web Services' Elastic Compute Cloud, which offers computation as a service to thousands of businesses, and its Relational Database Service, began experiencing errors shortly before 2 a.m. PDT on Thursday at Amazon's US-EAST data center in Virginia and the service interruption has been ongoing for more than nine hours now.
The technical problems have slowed or disabled access to the websites of customers utilizing AWS US-East resources, including Engine Yard, Foursquare, Hootsuite, Heroku, Quora, and Reddit, to name a few.
Shortly before noon PDT on Thursday, Reddit displayed a notice saying the discussion site is "is in 'emergency read-only mode' right now because Amazon is experiencing a degradation. They are working on it but we are still waiting for them to get to our volumes."
Hootsuite, Foursquare, and Quora displayed similar messages, while Heroku was inaccessible.
Amazon did not respond to a request for comment.
Engine Yard, a Ruby on Rails cloud service provider, was affected by the AWS outage, but Mike Piech, VP product management and marketing, said in an interview that the company weathered the storm because its business revolves around adding value to Amazon's cloud. As a service provider itself, the company has been working to limit the impact of a possible outage on clients by utilizing multiple Amazon data centers.
Engine Yard has been running EC2 instances exclusively out of Amazon's US-EAST facility, but the company has been beta testing multi-region availability to mitigate the risk of an outage. The goal is to host EC2 instances out of AWS facilities on the West Coast, in Europe, and Asia. As a result of the outage, Engine Yard accelerated its availability in other regions to help affected clients.
Piech insisted that hardware problems happen and the incident has not affected his company's interest in working with AWS.
The outage lit up the AWS customer support forum. An individual posting under the name "elephantdrive," which also is the name of a cloud storage service running atop AWS, echoed the frustration expressed by many other forum users that communication about the outage has been inadequate.
"We certainly understand that no operational infrastructure will be immune from downtime," said the person posting under the name elephantdrive. "We just want some estimate as to when the issue will be resolved. The Health page describes a problem and steps to resolution, but provides no estimates. We need some information to try to make business decisions."
Indeed, the sentiment expressed by many AWS customers is that the issue isn't so much about downtime, which happens, as it is about inadequate communication about the downtime.
Yet not everyone was so sanguine about the cloud. Jimmy Tam, general manager of Peer Software, a data backup and enterprise collaboration company, argued in an interview that outsourcing IT infrastructure to cloud service providers isn't the right choice for a lot of enterprise customers.
He cited global network performance as a major issue. "The cloud can be good for offices that have great bandwidth, but a lot of areas in the world don't have that," he said.
Tam pointed to one of his company's clients, a swimwear company that creates its designs in Los Angeles and runs its production in China. Getting design files uploaded and downloaded can take hours, he said, owning to the large file sizes and poor network bandwidth. "The cloud doesn't have sophisticated design software," he said. "You design on the desktop."
Outages like the one experienced by AWS present problems too. "If the file is local, I'm not worried about lost Internet connectivity," he said. "If you have an outage, that means everybody who is connected to the cloud can't have access." And he also insisted that data loss remains a possibility.
And he pointed to the risk that cloud service providers may choose to discontinue certain services, as Iron Mountain recently did. That leaves IT teams scrambling to come up with alternatives. "The cloud in theory is great," he said. But I don't think any cloud provider has solved all of these issues."
As of 10:35 a.m. PDT, Amazon finally had some good news to share. "We are making progress on restoring access and IO latencies for affected RDS instances," the company said. "We recommend that you do not attempt to recover using Reboot or Restore database instance APIs or try to create a new user snapshot for your RDS instance--currently those requests are not being processed."
However, the outage looks as if it will trigger service credit under Amazon's 99.95% Service Level Agreement. With 8,760 hours in a year, AWS can be inaccessible for 4.38 hours annually under that agreement.
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.