Amazon Meltdown Required Reading At Fail University
We're starting to learn more about what happened during the Great AWS Outage last month. Perhaps the fault lies not in our servers, but in ourselves.
After more than a week of post-mortem analysis, Amazon Web Services issued a detailed description of what was behind last month's meltdown that affected hundreds of their customers. I would like to welcome the incoming class of Amazon Fail University (AFU). It's a continuing education program, because Internet technology changes quickly. Last year's lesson isn't enough to make you an expert at this year's Internet. Here is an example.
More than a decade ago, I worked at a company that maintained its own website. Nobody was all that serious about the Web at the time, and we were no different. The server sat in the corner of the office, and was connected by a 1.5 Mbps T1 line. (Hey, at least it was plugged into a UPS!) At first the office shared the line for Internet access, but we quickly ran out of bandwidth and got the Web server its own T1. But the writing was on the wall; we knew we needed more bandwidth for our growing website.
Since it was expensive to get big bandwidth to our offices, we moved the servers to a colocation site near the offices that had a 45-Mbps shared T3 line. It solved the bandwidth bottleneck to be sure, but now our server admin had to jump into the car with a box of spare parts whenever there was a problem that couldn't be solved by the colocation's onsite staff (essentially, anything more complex than rebooting the box).
As the website grew, we decided that it didn't make sense to buy and maintain more hardware ourselves. So we moved to managed hosting, where we simply leased the hardware and the provider took care of the hardware and network maintenance. They even took care of monitoring the systems and applying our Windows security patches. This seemed like Nirvana, and worked really well--until SQL Slammer hit in January 2003.
Unlike the small disruption of Amazon's outage, SQL Slammer was an Internet-wide disaster that took many websites offline for more than a day. The problem was caused by a security problem in Microsoft SQL Server, one that Microsoft had patched more than six months earlier. Yet many sites hadn't bothered to install the patch yet. Some had even told our managed hosting company to not install the patch, because they didn't want to take the downtime of a reboot. As a result, our hosting company had to disconnect itself from the Internet and take the offending servers offline one at a time to stop the packet storm.
That 2003 incident was like AFU 101, and the course material was tough. SQL Slammer showed us that the actions (or inactions) of our neighbors could seriously affect us, even if we're not sharing a server with them. There is just too much interdependent infrastructure--not just hardware, but network connectivity and software as well. After the SQL Slammer incident, many companies changed their attitudes about patching and started asking more questions about how their servers were configured and connected to the Internet. Managed hosting companies started to be more diligent and insistent about installing patches for their customers as well.
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.