Big Data. Big Decisions
InformationWeek
Special Coverage Series


Amazon's Dec. 24th Outage: A Closer Look

Amazon Web Services once again cites human error spread by automated systems for loss of load balancing at key facility Christmas Eve.

On Christmas Eve, Amazon Web services experienced an outage at its Northern Virginia data center. In a prompt follow up, it issued an explanation on Dec. 29, apologized to customers and said it wouldn't happen again. It was the fourth outage of the year in its most heavily trafficked data center complex.

Explanations in the press of what happened, based on the Dec. 29 statement, were relatively brief. The Wall Street Journal, for example, stated that Amazon spokesmen blamed the outage "on a developer who accidentally deleted some key data ... Amazon said the disruption affected its Elastic Load Balancing Service, which distributes incoming data from applications to be handled by different computing hardware."

More Insights

Webcasts

More >>

White Papers

More >>

Reports

More >>

To an IT manager thinking of using Amazon, that leaves as much unexplained as explained. A developer disrupted running production systems? Development and production are kept distinctly separate in enterprise data centers for exactly the reason demonstrated in the Dec. 24 outage. The developer, Amazon took pains to explain, was "one of a very small number of developers who have access to this production environment." Amazon is a large organization with many developers; how many developers had access?

[ Want to learn how automated procedures can turn cloud architecture into its own worst enemy? See Post Mortem: When Amazon's Cloud Turned On Itself. ]

The developer launched a maintenance process against the running production system which deleted the state information needed by load balancers. "Unfortunately, the developer did not realize the mistake at the time. After this data was deleted, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers," the Amazon team's statement said.

The cloud promises greater efficiency than enterprise data centers because it offers both a more uniform and more automated environment. However, when something unexpected goes wrong, as Amazon customers saw in the 2011 Easter weekend "remirroring storm," automation takes over and can amplify the error. That started to happen Christmas Eve around 12:30 p.m. Pacific time.

The AWS trouble shooters spotted the error rates for API calls, but a larger underlying problem was developing out of sight. When a customer sought to modify his load balancer configuration, the Elastic Load Balancer control plane needed the state information that had been deleted. "Load balancers that were modified (by customers) were improperly configured by the control plane. This resulted in degraded performance and errors for customer applications," and the problem began to spread.

The AWS trouble shooters noticed more load balancers were issuing increased error rates and realized some sort of infection was spreading out. It didn't affect newly created load balancers, only those that had been operating prior to the developer's maintenance procedure. They dug "deeply into these degraded load balancers (and) identified the missing ELB state data."

At that point, it became a containment and recovery problem. After 4.5 hours of disruption, with 6.8% of load balancers affected, the team disabled the control plane workflows that could spread the problem. Other running load balancers couldn't scale up or be modified by customers, a serious setback on the final day of the Christmas shopping season. Netflix customers who wished to spend Christmas Eve watching "It's A Wonderful Life" or "Miracle on 34th Street," found they weren't able to access the films.

"The team was able to manually recover some of the affected running load balancers" by that evening and "worked through the night to try to restore the missing ELB state data." But initial effort went awry, consuming several more hours but "failed to provide a usable snapshot of the data."

A second recovery attempt worked. At 2:45 a.m. Pacific Dec. 25, or more than 14 hours after the disruption started, the missing state data was re-established, but even this was a near thing. The recovery occurred "just before the data was deleted," the Amazon statement acknowledged. The troubleshooters merged the state data back into the control plane "carefully" to avoid disrupting any running load balancers. By 10:30 a.m. Pacific Dec. 25, 22 hours after its start, most load balancers were back in normal operation.

AWS continued to monitor its running load balancers closely and waited until 12:05 p.m. Pacific before announcing that operations were back to normal.

Compared to previous events, there was a greater degree of transparency into this event than in some previous AWS outages. Immediately after a 2011 power outage at its Dublin, Ireland, data center, Amazon officials stated the local power utility said a lightning strike had been responsible. As it turned out, the utility later reported no strike ever occurred. In its Dec. 29 statement, the human error is there for all to see, along with the fits and jerks of the response. In this explanation, Amazon's response more closely resembles the standard set by Microsoft in its explanation of its own Windows Azure Leap Day bug Feb. 29 last year.

Even so, potential cloud users will frown on the fact that some developers have some access to running EC2 production systems. "We have made a number of changes to protect the ELB service from this sort of disruption in the future," the AWS team stated.

Normally a developer has one-time access to run a process. The developer in question had a more persistent level of access, which Amazon is revoking to make each case subject to administrative approval, controlled by a change management process. "This would have prevented the ELB state data from being deleted. This is a protection we use across all of our services that has prevented this of problem in the past, but was not appropriately enabled for this ELB state data."

So the official explanation says an unexpected event occurred but it won't occur elsewhere, due to protections already in place. The team said it could now reprogram the control plane workflows "to more thoughtfully reconcile current load balancer state" with a central service running the cloud.

Both the solution and the choice of words describing it illustrate that cloud operations are complex and service providers have attempted to think of everything. That they couldn't do so at this stage of cloud computing is illustrated by the disruption and the "more thoughtful" approach that followed it.

"We want to apologize. We know how critical our services are to our customers' businesses, and we know this disruption came at an inopportune time. We will do everything we can to learn from this event and use it to drive further improvement in the ELB service."

In the past Amazon's explanations of what's gone wrong have been terse to the point of being difficult to comprehend. The apologetic note following a clear, technical explanation parallels the pattern set in Microsoft's Leap Day event.

Bill Laing, Microsoft's corporate VP for servers and Azure, wrote after the Leap Day incident: "We know that many of our customers were impacted by this event. We sincerely apologize for the disruption, downtime, and inconvenience this incident has caused. Rest assured that we are already hard at work using our learnings to improve Windows Azure,"

The idea that cloud computing requires transparency, particularly when something goes wrong, is catching on and may yet become a standard of operation across service providers. Microsoft is moving toward offering more infrastructure as a service on top of Azure's platform as a service, a form of computing more oriented toward developers. Infrastructure as a service needs to attract enterprise workloads, and to do so, must establish enterprise trust. Amazon, despite the outage, is trying to move in that direction.



Related Reading




Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

BYTE encourages readers to engage in spirited, healthy debate, including taking us to task. However, BYTE moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. BYTE further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

Follow InformationWeek

By The Numbers

What Are Your Primary Concerns About Using Big Data Software?

Base: 417 respondents at organizations using or planning to deploy data analytics, BI or statistical analysis software
Data: InformationWeek 2013 Analytics, Business Intelligence and Information Management Survey of 541 business technology professionals, October 2012

What Do You Think?

What's your attitude about SQL analysis on top of Hadoop?
We want fast, standard SQL analysis capabilities on Hadoop ASAP
Hadoop is for unstructured data; SQL is for relational databases
We'll give SQL on Hadoop a try, but relational DBs will remain the mainstay
Given strong SQL support on Hadoop, we'd nix the data warehouse
We're not interested in Hadoop
No opinion



Related Content

From Our Sponsor

Five Big Data Challenges and How to Overcome Them with Visual Analytics

Five Big Data Challenges and How to Overcome Them with Visual Analytics

Business leaders often need a visual snapshot of data to quickly grasp and use it. This paper identifies five challenges in presenting data and how visual analytics can resolve them. Solutions are suggested to overcome the challenges of: speed, data clarity, data quality, displaying meaningful results, and dealing with outliers.

Game-Changing Analytics: How IT Executives Can Use Analytics to Create Innovation and Business Success

Game-Changing Analytics: How IT Executives Can Use Analytics to Create Innovation and Business Success

Today's competitive advantage requires a deeper understanding of your business, your market and your customers. As an IT executive, you can drive that knowledge transformation. In this white paper, learn how to make decisions as a strategic business leader and three steps to begin an analytics initiative within your enterprise.

Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics

Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics

High-performance data visualization turns sophisticated analyses into meaningful graphics, leading to faster and smarter decision making. In this white paper, learn how visual analytics can transform big data, with additional features such as real-time functionality, mobile compatibility, robust applications for technical groups and accessibility for nontechnical users.

Big Data: Lessons from the Leaders

Big Data: Lessons from the Leaders

Financial performance, competitive advantage, operational efficiency, strategic decision making - every business goal can extract value from big data, and the time for doubt or inaction has long passed. In this Economist Intelligence Unit report, in-depth interviews with data pioneers reveal the link between the effective use of big data and the bottom line among other results.

Decision-Driven Data Management: A Strategy for Better Decisions with Better Data

Decision-Driven Data Management: A Strategy for Better Decisions with Better Data

Which came first, the data or the decision? This white paper makes the case for having a decision in mind, then tailoring big data's volume, variety and velocity to achieve business results such as overcoming customer dissatisfaction or creating well-informed strategies in real time.

Informationweek Reports

Research: The Big Data Management Challenge

Research: The Big Data Management Challenge

The challenge of big data is real, but most organizations don't differentiate 'big data' from traditional data, and nearly 90% of respondents to our survey use conventional databases as the primary means of handling data. We'll help you understand what constitutes big data (it's not just size) and the numerous management challenges it poses.