Amazon Cloud Outage Cleanup Hits Software Error - InformationWeek
Cloud // Infrastructure as a Service
09:38 PM
Connect Directly

Amazon Cloud Outage Cleanup Hits Software Error

Software error accidentally eliminated data snapshots taken just before Sunday's outage; may complicate some customers' recovery.

Slideshow: Amazon's Case For Enterprise Cloud Computing
Slideshow: Amazon's Case For Enterprise Cloud Computing
(click image for larger view and for full slideshow)
An error embedded in a piece of Amazon Web Services cleanup software has resulted in some customers having their backup data snapshots deleted from EC2's European data center.

Amazon dashboard notices of the problem indicate most of the data was recoverable but it's not clear whether that happened in every instance.

The error appears to have been discovered as the Dublin data center was in the process of helping customers recover from a power outage that struck both the primary and backup supplies of electricity to the Dublin, Ireland data center, known as EU-West-1, Sunday.

The Elastic Block Store cleanup process eliminates unused snapshots of an application's data. The snapshots are created as temporary backup copies during an application's run, in case data is lost during processing. When a system failure occurs, such as the power outage the day before, they serve as a nearly up-to-date source for rebuilding data.

Neither the software error or its malfunction Monday were directly related to the power outage the day before. Primary power was lost due to a lightning strike on a transformer and a resulting fire. Backup power did not kick in because the control system for the backup generators was knocked out by the strike as well. But discovery of an error in cleanup systems the next day means the difficulties some customers are experiencing to recover their systems have been compounded. It's possible that one or more customers expecting to rely on the snapshots will learn that the backup data was inadvertently eliminated.

At 3:11 p.m. Pacific time Monday, Amazon notified European customers via its Service Health Dashboard that blocks of snapshot data "were incorrectly deleted" by EBS service cleanup software. "The root cause was a software error ..." the Service Health Dashboard notice said.

As the EBS cleanup system did its work, the error caused references to some blocks "to be missed during reference counting process," which prompted the EBS snapshot management system to conclude that those blocks weren't being used and should be deleted. In many cases, other snapshots of the same data were saved. Those snapshots may be either older or younger versions than those deleted. If older, that means recent data was lost in the deleted snapshot.

"We've addressed the error in the EBS snapshot system to prevent it from recurring," the dashboard said.

Nearly 12 hours later, at 2:53 a.m. Pacific time Tuesday, AWS was able to report: "We are continuing to make steady progress in delivering recovery snapshots to affected customers accounts. We will continue to post updates here."

It was not clear when the process would be completed. At 8:06 a.m. Pacific time Tuesday, it reported over half of the volumes that had been in an inconsistent state had been recovered. Customers still needed to run a file-checking system to verify that files recovered were in the format expected by the application.

Amazon's compute service in its U.S. East-1 northern Virginia data center was reported about 7:30 p.m. Pacific time Monday to have momentary connectivity problems, followed by a momentary disconnection of its relational database service. Both were functioning without connectivity issues shortly after 8 p.m. Pacific.

InformationWeek Analytics has published a report on backing up VM disk files and building a resilient infrastructure that can tolerate hardware and software failures. After all, what's the point of constructing a virtualized infrastructure without a plan to keep systems up and running in case of a glitch--or outright disaster? Download the report now. (Free registration required.)

Comment  | 
Print  | 
More Insights
Threaded  |  Newest First  |  Oldest First
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
2017 State of the Cloud Report
As the use of public cloud becomes a given, IT leaders must navigate the transition and advocate for management tools or architectures that allow them to realize the benefits they seek. Download this report to explore the issues and how to best leverage the cloud moving forward.
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on for the week of November 6, 2016. We'll be talking with the editors and correspondents who brought you the top stories of the week to get the "story behind the story."
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll