At first glance, this past week was a disaster for cloud computing PR. A problem--make that a meltdown--in an East Coast Amazon Web Services data center caused hundreds of websites to be down for at least a full day, and sometimes more. It wasn't exactly a "Yay, cloud!" moment.
Although we won't know all the details until Amazon Web Services gets out of crisis mode and has an opportunity to publish a post-mortem, it seems that the problem started with a service called Elastic Block Storage (EBS). Amazon's description is that "Amazon Elastic Block Store provides highly available, highly reliable storage volumes that can be attached to a running Amazon EC2 [Elastic Compute Cloud] instance and exposed as a device within the instance. Amazon EBS is particularly suited for applications that require a database, file system, or access to raw block level storage." In essence, EBS lets you attach a "portable" hard disk to your virtual server without needing to have it physically attached to that server.
Initially, it might appear that this could be a classic Single Point of Failure (SPoF) where EBS was the culprit. One of the problems with cloud computing today is that mere mortals have a hard time knowing all the places where a SPoF can occur in the cloud. From the outside it may appear that you've covered all the bases as far as redundancy is concerned, but it often isn't that easy. The more virtual and indirect the environment, the worse the problem gets. Let me give you an example.
Years ago when I did software development in the telecommunications business, a customer came to our company looking for a backup data connection for their options trading firm. We were glad to provide one, and things went well for several months as they rarely used the capacity for anything more than testing. Then one day the customer's primary connection on AT&T went down when a backhoe ripped through the fiber-optic cable, so they switched over to us. But our connection was down too. It turns out that we had bought capacity from AT&T -- their supposedly redundant line was going through the very same fiber as their main connection! But that wasn't visible to the customer.
Although the Amazon problem indeed seems to have started with a failure of just the EBS service in one data center, early information seems to be that this resulted a cascading widespread failure in Amazon's data centers, caused by congestive collapse. As Amazon customers noticed that their servers were failing, they were in the dark about exactly why the failures were occurring. So they tried starting new instances, moving their data to other zones in Amazon's network, and all kinds of activity that only added to the congestion in the network. So now the problem was not just EBS, but the traffic jam caused by people trying to get around EBS failures.
Despite this turbulent April shower in the cloud last week, the industry can't give up on cloud computing. As the largest provider of cloud services, Amazon was the most likely to fall victim to a problem like this. Perhaps it's an architectural problem with EBS; if so I'd expect that Amazon will determine that in the post-mortem and come up with changes or procedures to make sure the problem doesn't happen again. It doesn't make sense for most companies to be in the business of running data centers and managing PCs full of precious data that must be backed up to prevent catastrophe. Companies should be able to focus on their own lines of business and manage information, not computers. Cloud computing can help companies do that.