Cloud // Infrastructure as a Service
09:38 PM
Connect Directly
Repost This

Amazon Cloud Outage Cleanup Hits Software Error

Software error accidentally eliminated data snapshots taken just before Sunday's outage; may complicate some customers' recovery.

Slideshow: Amazon's Case For Enterprise Cloud Computing
Slideshow: Amazon's Case For Enterprise Cloud Computing
(click image for larger view and for full slideshow)
An error embedded in a piece of Amazon Web Services cleanup software has resulted in some customers having their backup data snapshots deleted from EC2's European data center.

Amazon dashboard notices of the problem indicate most of the data was recoverable but it's not clear whether that happened in every instance.

The error appears to have been discovered as the Dublin data center was in the process of helping customers recover from a power outage that struck both the primary and backup supplies of electricity to the Dublin, Ireland data center, known as EU-West-1, Sunday.

The Elastic Block Store cleanup process eliminates unused snapshots of an application's data. The snapshots are created as temporary backup copies during an application's run, in case data is lost during processing. When a system failure occurs, such as the power outage the day before, they serve as a nearly up-to-date source for rebuilding data.

Neither the software error or its malfunction Monday were directly related to the power outage the day before. Primary power was lost due to a lightning strike on a transformer and a resulting fire. Backup power did not kick in because the control system for the backup generators was knocked out by the strike as well. But discovery of an error in cleanup systems the next day means the difficulties some customers are experiencing to recover their systems have been compounded. It's possible that one or more customers expecting to rely on the snapshots will learn that the backup data was inadvertently eliminated.

At 3:11 p.m. Pacific time Monday, Amazon notified European customers via its Service Health Dashboard that blocks of snapshot data "were incorrectly deleted" by EBS service cleanup software. "The root cause was a software error ..." the Service Health Dashboard notice said.

As the EBS cleanup system did its work, the error caused references to some blocks "to be missed during reference counting process," which prompted the EBS snapshot management system to conclude that those blocks weren't being used and should be deleted. In many cases, other snapshots of the same data were saved. Those snapshots may be either older or younger versions than those deleted. If older, that means recent data was lost in the deleted snapshot.

"We've addressed the error in the EBS snapshot system to prevent it from recurring," the dashboard said.

Nearly 12 hours later, at 2:53 a.m. Pacific time Tuesday, AWS was able to report: "We are continuing to make steady progress in delivering recovery snapshots to affected customers accounts. We will continue to post updates here."

It was not clear when the process would be completed. At 8:06 a.m. Pacific time Tuesday, it reported over half of the volumes that had been in an inconsistent state had been recovered. Customers still needed to run a file-checking system to verify that files recovered were in the format expected by the application.

Amazon's compute service in its U.S. East-1 northern Virginia data center was reported about 7:30 p.m. Pacific time Monday to have momentary connectivity problems, followed by a momentary disconnection of its relational database service. Both were functioning without connectivity issues shortly after 8 p.m. Pacific.

InformationWeek Analytics has published a report on backing up VM disk files and building a resilient infrastructure that can tolerate hardware and software failures. After all, what's the point of constructing a virtualized infrastructure without a plan to keep systems up and running in case of a glitch--or outright disaster? Download the report now. (Free registration required.)

Comment  | 
Print  | 
More Insights
2014 Private Cloud Survey
2014 Private Cloud Survey
Respondents are on a roll: 53% brought their private clouds from concept to production in less than one year, and 60% ­extend their clouds across multiple datacenters. But expertise is scarce, with 51% saying acquiring skilled employees is a roadblock.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Elite 100 - 2014
Our InformationWeek Elite 100 issue -- our 26th ranking of technology innovators -- shines a spotlight on businesses that are succeeding because of their digital strategies. We take a close at look at the top five companies in this year's ranking and the eight winners of our Business Innovation awards, and offer 20 great ideas that you can use in your company. We also provide a ranked list of our Elite 100 innovators.
Twitter Feed
Audio Interviews
Archived Audio Interviews
GE is a leader in combining connected devices and advanced analytics in pursuit of practical goals like less downtime, lower operating costs, and higher throughput. At GIO Power & Water, CIO Jim Fowler is part of the team exploring how to apply these techniques to some of the world's essential infrastructure, from power plants to water treatment systems. Join us, and bring your questions, as we talk about what's ahead.