Cloud // Software as a Service
News
6/30/2014
09:06 AM
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Microsoft Explains Exchange Outage

Microsoft pledges to do better after frustrating customers with last week's Exchange Online and Lync Online outages.

Microsoft Office For iPad Vs. iWork Vs. Google
Microsoft Office For iPad Vs. iWork Vs. Google
(Click image for larger view and slideshow.)

Microsoft has provided more details to explain the outages suffered last week by its Exchange Online and Lync Online hosted services. Some customers were unable to reach Lync for several hours Monday, and some Exchange users went nine hours Tuesday without access to email. Many customers took to Microsoft's online forums and social media accounts to voice displeasure, not only at the service outage, but also at Microsoft's handling of the situation.

In a blog post, VP of Office 365 engineering Rajesh Jha said both outages affected Microsoft's North American data centers but that the issues were unrelated. "Email and real-time communications are critical to your business, and my team and I fully recognize our accountability and responsibility as your partner and service provider," he wrote.

[Microsoft VP predicts the cloud will evolve into just a few big players. Read more from the Structure conference: Cloud Trends To Watch: Structure 2014.]

Jha said the June 23 Lync Online disruption stemmed from external network failures that caused a short loss of client connectivity in Microsoft's data centers. The connectivity problem persisted only a few minutes, but Microsoft claims the ensuing traffic spike caused networking elements to become overloaded, which led to some customers' extended service issues.

The June 24 Exchange Online disruption, meanwhile, was caused by a periodic failure that caused a directory partition to stop reacting to authentication requests. Jha said "a small set of customers" lost email access altogether, and that others -- due to another, previously unknown flaw -- experienced email delays. Jha did not divulge how many customers were directly affected by Exchange Online's root error, nor how many dealt with the larger ripple-out effects.

The Exchange outage was compounded by a problem in Microsoft's Service Health Dashboard publishing process. The dashboard indicated to some customers that their services were fully functional, even as those services refused to load.

Jha said Microsoft has a full understanding of the problems that caused the disruptions, and is "working on further layers of hardening" to protect against future outages. He said customers can expect a Post-Incident Report in their Service Health Dashboards. Jha promised it will contain a detailed analysis of what went wrong, how Microsoft reacted, and how the company plans to avoid similar problems going forward. Though Jha's failure to detail how many customers were affected doesn't suggest a particularly transparent tone, Microsoft has a good record for sharing technical details following a service disruption.

Though Microsoft's cloud products experience few outages, this week's problems demonstrate why service lapses can be a big concern when they occur. Microsoft, Google, and others want companies to use cloud services to handle data and applications that have traditionally been hosted and managed in-house. The big cloud players have made progress over the last year, but all it takes is one outage to make professionals reconsider whether they want essential data and services to be handled by a third party.

During Tuesday's Exchange outage, a number of customers made such concerns abundantly clear. Microsoft didn't acknowledge the problems, which started around 6:00 a.m. EDT, for several hours. Even then, communications were labored; the company relied on user forums and social media to spread the word, which, given the Service Health Dashboard problem, left some customers confused and frustrated. Some criticized the company for euphemistically calling the disruption a mere "delay" in email deliveries.

"If by 'delays' you mean 6+ hours of complete outage," wrote Twitter user JD Wallace in response to a Microsoft tweet that acknowledged some Exchange customers were "experiencing email delays."

Others complained that Microsoft was slow to estimate when service might be restored. Some customers said they waited more than hour to talk via phone with Microsoft reps, only to be given no new information.

"Microsoft needs to work more with us. IT people are getting crazy without having [anything] to tell our users," a user with the handle JanetsyLeandro wrote in an Office 365 community forum. "We need a real update... [It's] causing a big problem to our business."

Time will tell whether the service outage affects the momentum of Exchange Online, Office 365, and other Microsoft cloud products. Was your business hit by last week's outages, and were you satisfied with Microsoft's response? Let us know in the comments.

Here's a step-by-step plan to mesh IT goals with business and customer objectives and, critically, measure your initiatives to ensure that the business is successful. Get the How To Tie Tech Innovation To Business Strategy report today (registration required).

Michael Endler joined InformationWeek as an associate editor in 2012. He previously worked in talent representation in the entertainment industry, as a freelance copywriter and photojournalist, and as a teacher. Michael earned a BA in English from Stanford University in 2005 ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Li Tan
50%
50%
Li Tan,
User Rank: Ninja
7/1/2014 | 9:31:18 AM
Re: Cloud transparency, round and round it goes
This is one critical thing about cloud based stuff. The cloud must be 24x7 up and running. No outage is really tolerable, which is quite different compared to old enterprise software days - we can at least allow some maintenance window. This is a challlenge for both development and operation personel.
PaulS681
50%
50%
PaulS681,
User Rank: Ninja
6/30/2014 | 8:40:52 PM
Re: Short Network Loss Overloads Networking Elements

Its kind of mind boggling the MS isn't prepared for spikes. We use office 365 and all i noticed that day was lync bouncing up and down. For whatever reason email wasn't affected much.

Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
6/30/2014 | 8:25:03 PM
Cloud transparency, round and round it goes
In two cases, Microsoft's Leap Day outage in 2012 and a later outage, it was pretty forthright on the occurrence and cause. In this case, some of that transparency is going down the drain.
rradina
50%
50%
rradina,
User Rank: Ninja
6/30/2014 | 8:07:58 PM
Re: If a Network Doesn't Fail in a Forest...
But their explanation doesn't jive with customer experience.  The "external" network issue only lasted a few minutes and yet cascaded into an all-day outage.  That sounds a freeway with so much traffic that a 15 minute flat tire on the shoulder creates a parking lot that takes all day to dissipate.  If you were in the shipping business, would you ever route deliveries on such a freeway?

I don't see any way to spin this positive unless we're still missing information such as a DDOS attack or some kind of rabid SPAM event.

One would also expect the world's largest software company whose goal is to be the world's largest cloud resource to have a plan C and probably even a plan D.  I also don't think it's unreasonable to expect that when plan A fails, a task force convenes and starts working on plan E and plan F -- possibly skipping plan C and D because they've come up with a specific response that solves the issue.

Imagine what might happen to a retailer that relied on Microsoft for credit card payments?  Why would a service provider of this claimed caliber assume e-mail is such a casual service?
pcharles09
50%
50%
pcharles09,
User Rank: Moderator
6/30/2014 | 6:14:48 PM
Re: If a Network Doesn't Fail in a Forest...
I feel bad for all the in-house IT guys/gals that had to deal with that. Where I work, there's a lot of remote users. I can imagine the headaches the internal folks dealt with.
vnewman2
50%
50%
vnewman2,
User Rank: Ninja
6/30/2014 | 2:26:01 PM
Re: If a Network Doesn't Fail in a Forest...
The communication on this issue from MSFT was poor, which heightened the frustration from the masses I think. Although I wasn't in the office that day (whew) here's the email we received from our SPAM company, MIMECAST. "Mimecast has identified that Office 365 servers may be issuing intermittent "4.3.2" deferrals for inbound messages. Mimecast services are working correctly and emails sent to these servers will continue to queue. Office 365 customers should contact Microsoft directly to report and investigate the issue." At least someone is looking out for us.
Number 6
100%
0%
Number 6,
User Rank: Moderator
6/30/2014 | 1:54:31 PM
If a Network Doesn't Fail in a Forest...
Here's a thought. What if Microsoft really can, and does, handle most network spikes without any noticeable delays or outages? We know when something fails, but how do we know when a Plan B does work? I'm not saying that's the case, but we wouldn't know if it was, would we?
cafzali
50%
50%
cafzali,
User Rank: Moderator
6/30/2014 | 1:43:05 PM
E-mail over cloud
I wonder how many of these situations have to occur before people stop relying on the "all in one" solution providers for productivity applications? While it's true that any e-mail server can fail, it seems as if companies selling all-in-one solutions seem to particularly be prone to failures. 

People once thought of Blackberry e-mail as "rock solid," until they had a few outages lasting multiple hours at a time. Like Microsoft, their main selling point was reliability. 
Laurianne
100%
0%
Laurianne,
User Rank: Author
6/30/2014 | 12:50:37 PM
Re: Short Network Loss Overloads Networking Elements
rradina, I understand your surprise. If anyone could burst up to extra capacity when traffic spikes, you'd think it would be the likes of Microsoft. It could be an Azure success story in that case.
rradina
50%
50%
rradina,
User Rank: Ninja
6/30/2014 | 11:52:59 AM
Short Network Loss Overloads Networking Elements
If minor network blip can create a traffic storm for which they aren't prepared, what happens if connectivity is lost to entire data center for several hours?

What happens if there's nothing wrong with MS data centers but a major fiber cut renders a large geographic swath of customers unable to connect and when repaired, the "traffic storm" takes out cloud e-mail for everyone?

Are they running their network capacity that close to 100%?  I'd think they'd have Hoover dam spillway sized pipes and paying for the potential to burst even higher if traffic warrants it.

The services likely run on thousands of virtual servers.  From a network perspective, it sounds like they should better-segment the traffic so they can perhaps shape the traffic and contain the storm.
8 Steps to Modern Service Management
8 Steps to Modern Service Management
ITSM as we know it is dead. SaaS helped kill it, and CIOs should be thankful. Hereís what comes next.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - September 10, 2014
A high-scale relational database? NoSQL database? Hadoop? Event-processing technology? When it comes to big data, one size doesn't fit all. Here's how to decide.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.