The faulty update effort lead to corruption of Hotmail's addressing on the Internet's Domain Name Service. Microsoft said what started out as "a service degradation" became a "service disruption" that took about 3.5 hours to analyze and fix.
"On Thursday, September 8th at approximately 8 p.m. PDT, Microsoft became aware of a Domain Name Service (DNS) problem causing service degradation for multiple cloud-based services," a Microsoft spokesman said in an email response to InformationWeek.
"A tool that helps balance network traffic was being updated, and for a currently unknown reason, the update did not work correctly. As a result, the configuration was corrupted, which caused service disruption. Service restoration began at approximately 10:30 p.m. PDT, with full service restoration completed at approximately 11:30 p.m. PDT. We are continuing to review the incident," a Microsoft spokesperson said.
For end users, a full service restoration was still to come. An update at 11:29 p.m. PDT on Sept. 8 by Chris Jones, author of the blog, Inside Windows Live, said the Domain Name System correction had just been effected and it would take at least 30 minutes for the change to propagate itself through the system. He said the DNS corruption may have caused problems accessing other Microsoft cloud services, including its cloud storage system, SkyDrive, and the Microsoft Office applications offered online.
The service disruption and length of Microsoft response prompted numerous comments from around the world as Hotmail users noted on Inside Windows Live and at DownRightNow.com that they were all experiencing the loss of service at the same time.
Changes to software by IT staff are major cause of outages in enterprise data centers. Cloud services, including Microsoft's Windows Azure, SkyDrive and Hotmail aren't immune either, even though cloud providers strive to automate as many operations as possible in ways that have been tested and proven free of human error.
Even so, the leader in cloud services, Amazon Web Services, also suffered from a common human error when an AWS network administrator in the early morning hours of April 21 switched a communications network onto a backup network unequal to the task of carrying all the traffic. Many running systems in EC2 found they couldn't access their data, triggering what Amazon termed on April 29 a "remirroring storm" as data systems tried to create new, accessible copies. That choked the systems and froze the ability of Elastic Block Store and Relational Database Service to access data and keep EC2 instances supplied with fresh data.
All brands of cloud providers will need to build in more safeguards against faulty software upgrades and human error as they continue to try to convince enterprise data center operators to make greater use of their services.
In January, Hotmail accidentally deleted 17,335 user accounts, then restored them through backup procedures.
Automation and orchestration technologies can make IT more efficient and better able to serve the business by streamlining common tasks and speeding service delivery. In this report, we outline the potential snags and share strategies and best practices to ensure successful implementation. Download our report here. (Free registration required.)