One of the benefits often associated with cloud computing is that the level of expected availability should be higher than most private enterprise data centers could ever achieve. The top cloud service providers have poured billions into underlying infrastructure to ensure customer uptime in the event of various failures. Yet, despite all the time, money and effort put into attempting to achieve 100% uptime, it’s a virtual certainty that your cloud service provider will one day fail you. So, what can be done about this?
Recently there were two high-profile cloud outages from Amazon and Microsoft that prove that even the biggest and best providers are vulnerable to outages. As enterprise cloud customers, it's important you understand the causes of outages to learn how to shield your company from such issues. Today, we're going to first look at why the Amazon Web Services (AWS) and Azure outages occurred. Then using that knowledge, we'll discuss how organizations that were affected by the outage could have largely avoided any downtime.
One day in late February, the Internet lit up with the news that many high-profile sites that were hosted on AWS, were no longer accessible. The outage reportedly included sites such as Netflix, Spotify, Pinterest and Buzzfeed. For many AWS customers, the outages lasted well over 4 hours and occurred during prime business hours for US-based organizations. In Amazon’s official summary of the service disruption, there are two interesting things to note. First, the company said the outage was caused when: “[An] authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.” In other words, human error caused the outage of hundreds of thousands of AWS customers.
The second revelation in the summary was that the popularity and incredible growth of AWS may have contributed to the overall size and duration of the outage. Apparently, some portions of Amazon S3 that control the ability to remove/replace capacity with no end-user impact, did not function as expected. The reason was as follows: “We have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”
Following the massive AWS outage, Microsoft had their own cloud hiccup – this time with their cloud storage services. The overnight outage reportedly affected 26 of 28 global data center regions and caused outages for managing storage as well as applications that leveraged data stored on Azure. The root cause of this outage was determined to be “a code defect which caused degradation in the master component of the service”. In this case, the problem was purely a technical error. Yet, it still took Microsoft upwards of eight hours to identify and mitigate.
The point of bringing up these two recent cloud outages is so that we understand that failures -- both technical and human -- do occur on even the most trusted cloud service provider networks. Despite all the multi-data center redundancy and other fancy high-availability techniques, if you're putting all your eggs in one cloud basket, expect outages to occur from time to time.
In order to help protect your organization, it's important that you diversify mission-critical applications and data across multiple, independent data center environments. In some cases, this could be a hybrid cloud approach where a private data center and a public cloud provider share in the servicing of applications -- or be ready to spin services up in a different data center in the event of a failure. In other cases, it might be better to spread duties over two or more cloud providers. While the management of more than one environment adds complexity, there is a burgeoning market for multi-cloud management platforms that are gaining in popularity thanks to high profile outages such as the ones just witnessed within the AWS and Azure clouds.