informa
/
2 min read
Commentary

Thank Goodness For The Microsoft Azure Crash

Over the past weekend, Microsoft Azure was unavailable for nearly a day. Microsoft's cloud OS offering is still in beta, so the company isn't making any promises about availability or reliability at this point. However, events like this are just what a company needs to improve the product before it ships -- and before it's too late.
Over the past weekend, Microsoft Azure was unavailable for nearly a day. Microsoft's cloud OS offering is still in beta, so the company isn't making any promises about availability or reliability at this point. However, events like this are just what a company needs to improve the product before it ships -- and before it's too late.Problems like this are critical to the experience a company must have in order to deliver a reliable service. Let me give you an example. At one Web company where I worked earlier this decade, the Web hosting providers had the typical uptime guarantees that amounted to a promise of less than one hour of continuous downtime per month. Although there were several annoying issues and problems, none ever came to the level of being able to claim any credit through the uptime guarantees.

Then came Jan. 23, 2003, and the SQL Slammer worm. The entire data center for this hosting company was wiped out for nearly 24 hours as a result of this worm. They were responsible for maintaining the firewalls, installing OS patches, and isolating customer loads to prevent interference, and they failed in all these categories on this day. Even by the contract terms of the hosting-friendly service-level agreement, we were issued a refund of more than half of that month's hosting fees.

After that day, the hosting company changed completely. They enacted new procedures and practices that have made them both reliable and responsive when problems occur. More important, problems are extremely rare. It took a reputational and financial disaster like SQL Slammer to bring the company to its senses and do what it needed to do. Companies like Amazon have had similar experiences with their S3 service and come out better for them.

This scenario seems to be playing out in the Microsoft Azure group. It took them a long time to figure out what was going on and fix this problem, and even now it doesn't sound like the Azure folks completely understand how the disaster played out. That is what makes this event such a great opportunity. Management should talk to the group and point out how catastrophic this would have been -- both financially and reputationally -- if Azure was a shipping product. Then they should dissect the problem and figure out how they'll avoid the problems where possible and respond more quickly when they do happen.