One of the reasons we're in this situation is that most companies have one foot in the cloud and one in the traditional data-center-and-local-device world. The more that a company controls its own infrastructure, the more it can see the dangers and bottlenecks. Moving servers, services, and data to the cloud can put the dangers out of sight and out of mind, especially when things seem to be going right for months. Yet when things go wrong, we all realize how little we know and understand about the massive cloud data centers that hold our data, and how even the experts are hard-pressed to keep up.
The result is incredibly silly statements like this one in the New York Times: "Industry analysts said the troubles would prompt many companies to reconsider relying on remote computers beyond their control." So, do these industry analysts recommend that everyone run their own servers in their own geographically distributed buildings, served by multiple redundant high-capacity Internet connections? Perhaps that makes sense for Google or Microsoft, but not for the majority of companies.
Lack of visibility and control in cloud failures is one of the things Amazon needs to address with its customers, in the postmortem, in its documentation, in status reporting, and in monitoring tools. If they don't, the customers are likely to make the problems worse with their efforts, causing further degradation and increasing the time it takes to recover. That appears to be what happened in this case.
Perhaps at least part of this problem is due to the AWS philosophy of Infrastructure as a Service. Amazon has provided many useful services but still leaves it to the customer to decide how to provision the processors and storage. Yet without the tools to optimize their infrastructure, customers may be shooting in the dark. Maybe the solution is to continue the trek to Platform as a Service similar to Microsoft Azure. Yes, it requires a scary leap of faith to move even more control to the provider of cloud services, especially now. But it gives the platform provider better control over the behavior of the entire platform in extreme or failure cases like this. And as we saw last week, when things go bad they can go very, very bad indeed.