Cloud computing can be diluted into uselessness when mixed with immature technologies and poor practices.
This incident isn't even necessarily an indictment of Amazon Web Services as a whole. Large Amazon customers such as Netflix had apparently architected their use of AWS in a way that made them largely immune to any visible service disruption. You can bet, however, that Netflix has engineers on board with an understanding of AWS to a degree that 90% of the other customers couldn't start to approach. And for the benefit of the future of AWS, it is in Amazon's interest to figure out ways that mortal-level customers don't need to know Amazon's internal architecture the way Netflix does.
One of the reasons we're in this situation is that most companies have one foot in the cloud and one in the traditional data-center-and-local-device world. The more that a company controls its own infrastructure, the more it can see the dangers and bottlenecks. Moving servers, services, and data to the cloud can put the dangers out of sight and out of mind, especially when things seem to be going right for months. Yet when things go wrong, we all realize how little we know and understand about the massive cloud data centers that hold our data, and how even the experts are hard-pressed to keep up.
The result is incredibly silly statements like this one in the New York Times: "Industry analysts said the troubles would prompt many companies to reconsider relying on remote computers beyond their control." So, do these industry analysts recommend that everyone run their own servers in their own geographically distributed buildings, served by multiple redundant high-capacity Internet connections? Perhaps that makes sense for Google or Microsoft, but not for the majority of companies.
Lack of visibility and control in cloud failures is one of the things Amazon needs to address with its customers, in the postmortem, in its documentation, in status reporting, and in monitoring tools. If they don't, the customers are likely to make the problems worse with their efforts, causing further degradation and increasing the time it takes to recover. That appears to be what happened in this case.
Perhaps at least part of this problem is due to the AWS philosophy of Infrastructure as a Service. Amazon has provided many useful services but still leaves it to the customer to decide how to provision the processors and storage. Yet without the tools to optimize their infrastructure, customers may be shooting in the dark. Maybe the solution is to continue the trek to Platform as a Service similar to Microsoft Azure. Yes, it requires a scary leap of faith to move even more control to the provider of cloud services, especially now. But it gives the platform provider better control over the behavior of the entire platform in extreme or failure cases like this. And as we saw last week, when things go bad they can go very, very bad indeed.
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
. We've got a management crisis right now, and we've also got an engagement crisis. Could the two be linked? Tune in for the next installment of IT Life Radio, Wednesday May 20th at 3PM ET to find out.