Cloud // Infrastructure as a Service
01:27 PM
Dave Methvin
Dave Methvin
Connect Directly
Repost This

Amazon Meltdown Required Reading At Fail University

We're starting to learn more about what happened during the Great AWS Outage last month. Perhaps the fault lies not in our servers, but in ourselves.

After more than a week of post-mortem analysis, Amazon Web Services issued a detailed description of what was behind last month's meltdown that affected hundreds of their customers. I would like to welcome the incoming class of Amazon Fail University (AFU). It's a continuing education program, because Internet technology changes quickly. Last year's lesson isn't enough to make you an expert at this year's Internet. Here is an example.

More than a decade ago, I worked at a company that maintained its own website. Nobody was all that serious about the Web at the time, and we were no different. The server sat in the corner of the office, and was connected by a 1.5 Mbps T1 line. (Hey, at least it was plugged into a UPS!) At first the office shared the line for Internet access, but we quickly ran out of bandwidth and got the Web server its own T1. But the writing was on the wall; we knew we needed more bandwidth for our growing website.

Since it was expensive to get big bandwidth to our offices, we moved the servers to a colocation site near the offices that had a 45-Mbps shared T3 line. It solved the bandwidth bottleneck to be sure, but now our server admin had to jump into the car with a box of spare parts whenever there was a problem that couldn't be solved by the colocation's onsite staff (essentially, anything more complex than rebooting the box).

As the website grew, we decided that it didn't make sense to buy and maintain more hardware ourselves. So we moved to managed hosting, where we simply leased the hardware and the provider took care of the hardware and network maintenance. They even took care of monitoring the systems and applying our Windows security patches. This seemed like Nirvana, and worked really well--until SQL Slammer hit in January 2003.

Unlike the small disruption of Amazon's outage, SQL Slammer was an Internet-wide disaster that took many websites offline for more than a day. The problem was caused by a security problem in Microsoft SQL Server, one that Microsoft had patched more than six months earlier. Yet many sites hadn't bothered to install the patch yet. Some had even told our managed hosting company to not install the patch, because they didn't want to take the downtime of a reboot. As a result, our hosting company had to disconnect itself from the Internet and take the offending servers offline one at a time to stop the packet storm.

That 2003 incident was like AFU 101, and the course material was tough. SQL Slammer showed us that the actions (or inactions) of our neighbors could seriously affect us, even if we're not sharing a server with them. There is just too much interdependent infrastructure--not just hardware, but network connectivity and software as well. After the SQL Slammer incident, many companies changed their attitudes about patching and started asking more questions about how their servers were configured and connected to the Internet. Managed hosting companies started to be more diligent and insistent about installing patches for their customers as well.

1 of 2
Comment  | 
Print  | 
More Insights
2014 Private Cloud Survey
2014 Private Cloud Survey
Respondents are on a roll: 53% brought their private clouds from concept to production in less than one year, and 60% ­extend their clouds across multiple datacenters. But expertise is scarce, with 51% saying acquiring skilled employees is a roadblock.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Elite 100 - 2014
Our InformationWeek Elite 100 issue -- our 26th ranking of technology innovators -- shines a spotlight on businesses that are succeeding because of their digital strategies. We take a close at look at the top five companies in this year's ranking and the eight winners of our Business Innovation awards, and offer 20 great ideas that you can use in your company. We also provide a ranked list of our Elite 100 innovators.
Twitter Feed
Audio Interviews
Archived Audio Interviews
GE is a leader in combining connected devices and advanced analytics in pursuit of practical goals like less downtime, lower operating costs, and higher throughput. At GIO Power & Water, CIO Jim Fowler is part of the team exploring how to apply these techniques to some of the world's essential infrastructure, from power plants to water treatment systems. Join us, and bring your questions, as we talk about what's ahead.