Cloud // Infrastructure as a Service
News
11/15/2013
08:00 AM
Connect Directly
Twitter
RSS
E-Mail
50%
50%
Repost This

Microsoft Pins Azure Slowdown On Cloud Software Fault

Microsoft Azure GM Mike Neil explains the Oct. 29-30 slowdown and the reason behind the widespread failure.

Microsoft's Windows Azure cloud froze up momentarily for a significant fraction of customers over October 29 and October 30 due to a glitch in its API for staging systems. In an interview, Azure general manager Mike Neil discussed the disruption as he described recent Azure operations.

There was a bug in the software API for staging systems that had escaped detection in testing. That was the conclusion from his "preliminary" investigation, Neil said. His finding has been reported up the chain of command to Qi Lu, president of online services, and Satya Nadella, president of the server and tools business, as is usual in such cases, Neil said during an interview in San Francisco earlier this month. A more complete explanation based on root cause analysis will be published at a later date, he said.

In effect, whether it was a customer or Microsoft itself using the staging API, the process of swapping out one virtual IP address for another, which allows a system to go from staging into production, was periodically failing. Neil couldn't specify the circumstances that caused it to stall, but the resulting disruption affected enough customers to set off a flurry of Twitter comments.

[Want to learn more about how cloud providers set their underlying architecture? See Cloud Infrastructure Buyers: Job 1 Is Studying Architecture.]

"A simple API call was failing... It was the first service-run of an API that enables swapping of virtual IP addresses," he said. The API had been thoroughly tested and functioned as expected in tests, right up until the moment it started failing while in actual production. Normally, the process of swapping a patched and undated preproduction system for a production system is invisible to the systems' users: Calls that had been going to the old system are smoothly moved into the updated one. On October 29, the process became all too visible, when the API call periodically failed to recognize the swapped-in system.

Mike Neil, Microsoft Azure General Manager

Microsoft uses the API for staging updates to Azure software itself and also makes it available to customers for staging their own production workloads. In disclosing the ailment -- it slowed or momentarily stalled some tasks, but systems did not actually stop functioning -- Neil also described some aspects of Microsoft's Azure architecture that aren't generally known.

When the trouble occurred, Microsoft concentrated not on identifying the exact nature of the problem and its fix, but on getting operations recovered and back to normal in the shortest period possible. His team's focus was on mean time to recovery, said Neil. Analysis and long-term fixing can follow the root cause analysis that's done at a later date. By October 31, everything was running normally again.

Normally a failure on Azure occurs in a limited region of the overall cloud. That's because Microsoft has organized is Azure datacenters into 1,000-server cells or "stamps," as it likes to call them. Microsoft, Google, and Amazon Web Services all organize their cloud infrastructures into regions, and within each region they have one or more data centers, which are themselves divided into subregions or "availability zones." Like an Amazon availability zone, a Microsoft stamp has its own power supply and network and communications links, as well as 1,000 servers and related storage. A failure that affected one stamp theoretically would be contained within its boundaries and not affect neighboring stamps.

The failure of the API implementation was evident because it occurred across multiple regions around the world at the same time. Usually any software change to Azure is carefully phased in to part of a data center site, then a whole center, then a region, etc. But the new API program had to be implemented across regions, after early phase-in success, at roughly the same time. It was too basic to continued operations to be brought in piecemeal. That caused the resulting slowdown and balky operations to be much more widespread and noticeable, not limited to a spot in one region, Neil conceded.

Azure's most dramatic outage was its Leap Day failure on February 29, 2012, when faulty security certificates convinced Azure governing software that more and more of its servers were failing and virtual machines needed to be transferred off them. The transfers set off more false alarms as new VMs were accompanied by improperly dated certificates, cascading the problem. It took Microsoft 2.5 hours to isolate the bug and 7.5 hours to correct it.

Other cloud providers have suffered similar mishaps. One of the most noted was Amazon's Easter weekend failure in 2011, when a night technician unplugged a main trunk network and replugged it into a smaller scale, backup network. That move set off "a re-mirroring storm" as virtual machines that could no longer access their usual data source called for a backup copy to be made.

Why can't the cloud have 100 percent uptime? "Hardware will fail. Software will have bugs. People will make mistakes," Neil said philosophically.

Want to relegate cloud software to edge apps or smaller businesses? No way. Also in the new, all-digital Cloud Software: Where Next? special issue of InformationWeek: The tech industry is rife with over-the-top, groundless predictions and estimates. (Free registration required.)

Comment  | 
Print  | 
More Insights
2014 Private Cloud Survey
2014 Private Cloud Survey
Respondents are on a roll: 53% brought their private clouds from concept to production in less than one year, and 60% ­extend their clouds across multiple datacenters. But expertise is scarce, with 51% saying acquiring skilled employees is a roadblock.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Elite 100 - 2014
Our InformationWeek Elite 100 issue -- our 26th ranking of technology innovators -- shines a spotlight on businesses that are succeeding because of their digital strategies. We take a close at look at the top five companies in this year's ranking and the eight winners of our Business Innovation awards, and offer 20 great ideas that you can use in your company. We also provide a ranked list of our Elite 100 innovators.
Video
Slideshows
Twitter Feed
Audio Interviews
Archived Audio Interviews
GE is a leader in combining connected devices and advanced analytics in pursuit of practical goals like less downtime, lower operating costs, and higher throughput. At GIO Power & Water, CIO Jim Fowler is part of the team exploring how to apply these techniques to some of the world's essential infrastructure, from power plants to water treatment systems. Join us, and bring your questions, as we talk about what's ahead.