Big Data. Big Decisions
InformationWeek
Special Coverage Series

Commentary

Charles Babcock

Charles Babcock

Editor At Large, InformationWeek

Post Mortem: When Amazon's Cloud Turned On Itself

For the cloud to be a permanent platform for enterprise computing, it can't be an environment where both computing and errors just occur on a larger scale.

The snafus in the cloud, it turns out, aren't so different from those occurring in the overworked, under-automated and undocumented processes of the average data center. According to Amazon's post mortem explanation of its recent hours-long outage, the failure was apparently triggered by a human error.

If so, processes susceptible to human error are not going to be good enough in the future, if the cloud is going to be a permanent platform for enterprise computing.

The cause of Amazon's recent outage, which would have been more of a disaster than it was but for the low Easter holiday traffic, was the result of a configuration error in a scheduled network update. The change was attempted in the middle of the night--at 3:47 a.m. in Northern Virginia-"as part of our normal scaling activity," according to the official explanation. That sounds like the EC2 data center was anticipating the start of early morning activity, where big customers such as Bizo or Reddit start refreshing hundreds of websites in preparation to meet the day's earliest readers.

The primary network serving one of the four availability zones in EC2's U.S. East-1 data center needed more network capacity. The attempt to provide it mistakenly shifted the traffic off a primary network onto a secondary and lower bandwidth network used for backup purposes. This is a change that has been probably correctly implemented thousands of times. It's the kind of error an operator could makes as a wrong choice on a menu or the entry of the name of the last network worked on instead of the one needed. In short, it was a human error that's all too likely to occur with anyone momentarily preoccupied with the price of mangoes or a flare up with a spouse.

However, I thought the Amazon Web Services cloud used more automated procedures than that. I thought clearly obvious errors had been anticipated and worked through, with defenses in place. Two lines of logic, checking the operator's decision, would have halted him in his tracks. A simple network configuration error should not be the source of a monumental hit to confidence in cloud computing. But apparently it is.

What happened next is not so different from what we speculated in Cloud Takes A Hit; Amazon Must Fix EC2 a week ago, based on the cryptic postings on the Services Health Dashboard. Eight minutes after the change marked the start of what the Amazon Service Health Watch dashboard described as "a networking event." The misconfiguration choked the backup network, which caused "a large number of EBS nodes in a single EBS cluster lost connection to their replicas."

An EBS cluster is servers and disk serving as short-term storage for running workloads in a given availability zone. The preceding description doesn't sound like much of an event, but in the cloud, it triggers a massive response. Suddenly large sets of data no longer knew whether their backup copy still existed on the cluster, and a central tenet of the cluster's operation is that a backup copy is always available--in case of a hardware failure.

The networking error in itself was relatively minor and easily rectified. But the error set up a massive "re-mirroring storm," a new and valuable addition to computing lexicon's already long list of disaster terms. So many Elastic Block Store volumes were trying to find disk space on which to recreate themselves that when they failed to find it, they aggressively tried again, tying up disk operations in a zone. You get the picture.

 1 | 2  | Next Page »


Related Reading


More Insights




Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

BYTE encourages readers to engage in spirited, healthy debate, including taking us to task. However, BYTE moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. BYTE further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

Follow InformationWeek

By The Numbers

What Are Your Primary Concerns About Using Big Data Software?

Base: 417 respondents at organizations using or planning to deploy data analytics, BI or statistical analysis software
Data: InformationWeek 2013 Analytics, Business Intelligence and Information Management Survey of 541 business technology professionals, October 2012

What Do You Think?

What's your attitude about SQL analysis on top of Hadoop?
We want fast, standard SQL analysis capabilities on Hadoop ASAP
Hadoop is for unstructured data; SQL is for relational databases
We'll give SQL on Hadoop a try, but relational DBs will remain the mainstay
Given strong SQL support on Hadoop, we'd nix the data warehouse
We're not interested in Hadoop
No opinion



Related Content

From Our Sponsor

Five Big Data Challenges and How to Overcome Them with Visual Analytics

Five Big Data Challenges and How to Overcome Them with Visual Analytics

Business leaders often need a visual snapshot of data to quickly grasp and use it. This paper identifies five challenges in presenting data and how visual analytics can resolve them. Solutions are suggested to overcome the challenges of: speed, data clarity, data quality, displaying meaningful results, and dealing with outliers.

Game-Changing Analytics: How IT Executives Can Use Analytics to Create Innovation and Business Success

Game-Changing Analytics: How IT Executives Can Use Analytics to Create Innovation and Business Success

Today's competitive advantage requires a deeper understanding of your business, your market and your customers. As an IT executive, you can drive that knowledge transformation. In this white paper, learn how to make decisions as a strategic business leader and three steps to begin an analytics initiative within your enterprise.

Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics

Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics

High-performance data visualization turns sophisticated analyses into meaningful graphics, leading to faster and smarter decision making. In this white paper, learn how visual analytics can transform big data, with additional features such as real-time functionality, mobile compatibility, robust applications for technical groups and accessibility for nontechnical users.

Big Data: Lessons from the Leaders

Big Data: Lessons from the Leaders

Financial performance, competitive advantage, operational efficiency, strategic decision making - every business goal can extract value from big data, and the time for doubt or inaction has long passed. In this Economist Intelligence Unit report, in-depth interviews with data pioneers reveal the link between the effective use of big data and the bottom line among other results.

Decision-Driven Data Management: A Strategy for Better Decisions with Better Data

Decision-Driven Data Management: A Strategy for Better Decisions with Better Data

Which came first, the data or the decision? This white paper makes the case for having a decision in mind, then tailoring big data's volume, variety and velocity to achieve business results such as overcoming customer dissatisfaction or creating well-informed strategies in real time.

Informationweek Reports

Research: The Big Data Management Challenge

Research: The Big Data Management Challenge

The challenge of big data is real, but most organizations don't differentiate 'big data' from traditional data, and nearly 90% of respondents to our survey use conventional databases as the primary means of handling data. We'll help you understand what constitutes big data (it's not just size) and the numerous management challenges it poses.