Microsoft has provided more details to explain the outages suffered last week by its Exchange Online and Lync Online hosted services. Some customers were unable to reach Lync for several hours Monday, and some Exchange users went nine hours Tuesday without access to email. Many customers took to Microsoft's online forums and social media accounts to voice displeasure, not only at the service outage, but also at Microsoft's handling of the situation.
In a blog post, VP of Office 365 engineering Rajesh Jha said both outages affected Microsoft's North American data centers but that the issues were unrelated. "Email and real-time communications are critical to your business, and my team and I fully recognize our accountability and responsibility as your partner and service provider," he wrote.
[Microsoft VP predicts the cloud will evolve into just a few big players. Read more from the Structure conference: Cloud Trends To Watch: Structure 2014.]
Jha said the June 23 Lync Online disruption stemmed from external network failures that caused a short loss of client connectivity in Microsoft's data centers. The connectivity problem persisted only a few minutes, but Microsoft claims the ensuing traffic spike caused networking elements to become overloaded, which led to some customers' extended service issues.
The June 24 Exchange Online disruption, meanwhile, was caused by a periodic failure that caused a directory partition to stop reacting to authentication requests. Jha said "a small set of customers" lost email access altogether, and that others -- due to another, previously unknown flaw -- experienced email delays. Jha did not divulge how many customers were directly affected by Exchange Online's root error, nor how many dealt with the larger ripple-out effects.
The Exchange outage was compounded by a problem in Microsoft's Service Health Dashboard publishing process. The dashboard indicated to some customers that their services were fully functional, even as those services refused to load.
Jha said Microsoft has a full understanding of the problems that caused the disruptions, and is "working on further layers of hardening" to protect against future outages. He said customers can expect a Post-Incident Report in their Service Health Dashboards. Jha promised it will contain a detailed analysis of what went wrong, how Microsoft reacted, and how the company plans to avoid similar problems going forward. Though Jha's failure to detail how many customers were affected doesn't suggest a particularly transparent tone, Microsoft has a good record for sharing technical details following a service disruption.
Though Microsoft's cloud products experience few outages, this week's problems demonstrate why service lapses can be a big concern when they occur. Microsoft, Google, and others want companies to use cloud services to handle data and applications that have traditionally been hosted and managed in-house. The big cloud players have made progress over the last year, but all it takes is one outage to make professionals reconsider whether they want essential data and services to be handled by a third party.
During Tuesday's Exchange outage, a number of customers made such concerns abundantly clear. Microsoft didn't acknowledge the problems, which started around 6:00 a.m. EDT, for several hours. Even then, communications were labored; the company relied on user forums and social media to spread the word, which, given the Service Health Dashboard problem, left some customers confused and frustrated. Some criticized the company for euphemistically calling the disruption a mere "delay" in email deliveries.
"If by 'delays' you mean 6+ hours of complete outage," wrote Twitter user JD Wallace in response to a Microsoft tweet that acknowledged some Exchange customers were "experiencing email delays."
Others complained that Microsoft was slow to estimate when service might be restored. Some customers said they waited more than hour to talk via phone with Microsoft reps, only to be given no new information.
"Microsoft needs to work more with us. IT people are getting crazy without having [anything] to tell our users," a user with the handle JanetsyLeandro wrote in an Office 365 community forum. "We need a real update... [It's] causing a big problem to our business."
Time will tell whether the service outage affects the momentum of Exchange Online, Office 365, and other Microsoft cloud products. Was your business hit by last week's outages, and were you satisfied with Microsoft's response? Let us know in the comments.
Here's a step-by-step plan to mesh IT goals with business and customer objectives and, critically, measure your initiatives to ensure that the business is successful. Get the How To Tie Tech Innovation To Business Strategy report today (registration required).