Microsoft's cloud services, including Xbox Live, were disrupted Thursday due to a DNS error.
Services connected to Microsoft's Windows Azure cloud suffered a disruption Thursday -- the second interruption in a less than a month. Online reports indicate impacted services included Microsoft.com, Outlook.com, Office 365, and Xbox Live. Microsoft had resolved most of the problems by Thursday night, avoiding the potentially cataclysmic possibility that Xbox Live would be down when Xbox One units went on sale just after midnight Friday morning.
The disruptions began at 2:22 p.m. PT and stretched across multiple regions. Microsoft corporate vice president Scott Guthrie confirmed via Twitter that the problem did not involve Azure itself. Rather, "The problem is a DNS name server issue outside of azure." Microsoft said Thursday evening that Azure was running normally.
As of Friday morning, the Azure service dashboard showed most services were functioning as intended, though partial interruptions were plaguing compute functions in Asia, Europe, and the US. Despite the outage, Windows Azure has generally proved as reliable as its competitors, many of which have also endured widespread disruptions. Amazon, for example, suffered a major failure over Easter weekend in 2011.
Glitches that knock multiple regions offline are especially rare because Microsoft, Amazon, and other major cloud providers typically organize datacenters into clumps -- or "stamps," in Microsoft parlance -- of 1,000 servers each.
These stamps include independent power, networking, and storage infrastructure. Theoretically, this tactic stops a problem in one place from spreading to others, thus keeping things like Azure available even when problems inevitably arise.
As Guthrie's tweet implies, if a DNS failure was the culprit, Microsoft's stamps weren't part of the problem. Rather, Azure was operating as it should; customers just couldn't reach it.
Though Azure outages are rare, Microsoft has typically been transparent when they've occurred. The company published a technical report following its most notorious disruption, the Leap Day interruption on Feb. 29, 2012. In that case, faulty security certificates incorrectly indicated that servers were failing, which triggered the cloud's governing software to transfer virtual machines inappropriately. The fact that the new VMs carried incorrect certificates themselves exacerbated the issue. Microsoft deployed a fix within 10 hours.
Azure outage affected Xbox Live hours before Xbox One went on sale.
Another significant outage occurred at the end of October. In that case, Azure GM Mike Neil told InformationWeek's Charles Babcock this week, the disruption stemmed from a bug in the API for staging systems. Neil said Microsoft will release its full analysis of the October incident this year. When a problem occurs, Microsoft focuses on restoring operations as quickly as possible to minimize the effect on customers. More in-depth forensic determinations, such as the root cause of the problem, are saved until later.
Some businesses remain hesitant to embrace the cloud due to concerns over security and reliability. Service disruptions such as the one that happened Thursday do little to persuade these skeptics. Nonetheless, Azure and the products it supports are among Microsoft's most promising assets.
Neil told Babcock that Microsoft's cloud is gaining 1,000 customers per day. The company reported in September that its Azure-backed Office 365 products were on pace to post $1.5 billion in annual revenue. Microsoft also said this year that more than 300,000 Azure servers would support enhanced Xbox One experiences.
Consumerization 1.0 was "we don't need IT." Today we need IT to bridge the gap between consumer and business tech. Also in the Consumerization 2.0 issue of InformationWeek: Stop worrying about the role of the CIO (free registration required).
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
. We've got a management crisis right now, and we've also got an engagement crisis. Could the two be linked? Tune in for the next installment of IT Life Radio, Wednesday May 20th at 3PM ET to find out.