Amazon Cloud Outage Cleanup Hits Software Error
Software error accidentally eliminated data snapshots taken just before Sunday's outage; may complicate some customers' recovery.
Amazon dashboard notices of the problem indicate most of the data was recoverable but it's not clear whether that happened in every instance.
More Storage Insights
Webcasts
- Best Practices in SMB Desktop Virtualization
- Forrester Total Economic Impact study of Midrange Storage
White Papers
- Silver Peaks Advantages in a Disaster Recovery Environment
- Dissolving Distance: Silver Peaks Technology Overview
Reports
More >>The error appears to have been discovered as the Dublin data center was in the process of helping customers recover from a power outage that struck both the primary and backup supplies of electricity to the Dublin, Ireland data center, known as EU-West-1, Sunday.
The Elastic Block Store cleanup process eliminates unused snapshots of an application's data. The snapshots are created as temporary backup copies during an application's run, in case data is lost during processing. When a system failure occurs, such as the power outage the day before, they serve as a nearly up-to-date source for rebuilding data.
Neither the software error or its malfunction Monday were directly related to the power outage the day before. Primary power was lost due to a lightning strike on a transformer and a resulting fire. Backup power did not kick in because the control system for the backup generators was knocked out by the strike as well. But discovery of an error in cleanup systems the next day means the difficulties some customers are experiencing to recover their systems have been compounded. It's possible that one or more customers expecting to rely on the snapshots will learn that the backup data was inadvertently eliminated.
At 3:11 p.m. Pacific time Monday, Amazon notified European customers via its Service Health Dashboard that blocks of snapshot data "were incorrectly deleted" by EBS service cleanup software. "The root cause was a software error ..." the Service Health Dashboard notice said.
As the EBS cleanup system did its work, the error caused references to some blocks "to be missed during reference counting process," which prompted the EBS snapshot management system to conclude that those blocks weren't being used and should be deleted. In many cases, other snapshots of the same data were saved. Those snapshots may be either older or younger versions than those deleted. If older, that means recent data was lost in the deleted snapshot.
"We've addressed the error in the EBS snapshot system to prevent it from recurring," the dashboard said.
Nearly 12 hours later, at 2:53 a.m. Pacific time Tuesday, AWS was able to report: "We are continuing to make steady progress in delivering recovery snapshots to affected customers accounts. We will continue to post updates here."
It was not clear when the process would be completed. At 8:06 a.m. Pacific time Tuesday, it reported over half of the volumes that had been in an inconsistent state had been recovered. Customers still needed to run a file-checking system to verify that files recovered were in the format expected by the application.
Amazon's compute service in its U.S. East-1 northern Virginia data center was reported about 7:30 p.m. Pacific time Monday to have momentary connectivity problems, followed by a momentary disconnection of its relational database service. Both were functioning without connectivity issues shortly after 8 p.m. Pacific.
InformationWeek Analytics has published a report on backing up VM disk files and building a resilient infrastructure that can tolerate hardware and software failures. After all, what's the point of constructing a virtualized infrastructure without a plan to keep systems up and running in case of a glitch--or outright disaster? Download the report now. (Free registration required.)
Related Reading
| To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy. |
Subscribe to RSSResource Links
Related Webcasts
- Reduce Cost and Improve Manageability with IBM Windows Storage Server
- Data Protection and Microsoft Office 365: How Proofpoint Addresses Concerns of the Distributed Enterprise
- Best Practices in SMB Desktop Virtualization
- CTO to CTO: Scott Davies, VMware, and Jim Davies, Mitel, Give Voice to the Virtual Desktop
- Forrester Total Economic Impact study of Midrange Storage
This Week's Issue
Free Print Subscription
SubscribeCurrent Healthcare Issue
- InformationWeek Healthcare CIO 25: Our second annual honor roll of the health IT leaders driving healthcare's transformation.
- EHR Unreadiness: Only a small percentage of physicians planning to apply for Meaningful Use funds have e-health record systems capable of achieving most of the requirements. .
- And much more!
- Read the Current Issue
Related Whitepapers
- Silver Peaks Advantages in a Disaster Recovery Environment
- Dissolving Distance: Silver Peaks Technology Overview
- Five Ways to Optimize Offsite Storage and Business Continuity: A WAN Optimization Primer for Storage Professionals
- Data center operational efficiency best practices
- How to Prepare Your Virtualized Data Center for the Cloud
Featured Resource
"Read this white paper to learn about the security issues you need to consider and how IBM assessment services and guidelines for securing cloud implementations can help you maximize the business value of cloud investments while minimizing risk.
Read Now












