Amazon Cloud Outage Cleanup Hits Software Error
Software error accidentally eliminated data snapshots taken just before Sunday's outage; may complicate some customers' recovery.
Amazon dashboard notices of the problem indicate most of the data was recoverable but it's not clear whether that happened in every instance.
More Storage Insights
- The Untapped Potential of Mobile Apps for Commercial Customers
- Get Actionable Insight with Security Intelligence for Mainframe Environments
- Aerospace Giant RUAG Achieves 93% Time Savings in Data Backup
- Tech Tips: VMware vCenter Site Recovery Manager
The error appears to have been discovered as the Dublin data center was in the process of helping customers recover from a power outage that struck both the primary and backup supplies of electricity to the Dublin, Ireland data center, known as EU-West-1, Sunday.
The Elastic Block Store cleanup process eliminates unused snapshots of an application's data. The snapshots are created as temporary backup copies during an application's run, in case data is lost during processing. When a system failure occurs, such as the power outage the day before, they serve as a nearly up-to-date source for rebuilding data.
Neither the software error or its malfunction Monday were directly related to the power outage the day before. Primary power was lost due to a lightning strike on a transformer and a resulting fire. Backup power did not kick in because the control system for the backup generators was knocked out by the strike as well. But discovery of an error in cleanup systems the next day means the difficulties some customers are experiencing to recover their systems have been compounded. It's possible that one or more customers expecting to rely on the snapshots will learn that the backup data was inadvertently eliminated.
At 3:11 p.m. Pacific time Monday, Amazon notified European customers via its Service Health Dashboard that blocks of snapshot data "were incorrectly deleted" by EBS service cleanup software. "The root cause was a software error ..." the Service Health Dashboard notice said.
As the EBS cleanup system did its work, the error caused references to some blocks "to be missed during reference counting process," which prompted the EBS snapshot management system to conclude that those blocks weren't being used and should be deleted. In many cases, other snapshots of the same data were saved. Those snapshots may be either older or younger versions than those deleted. If older, that means recent data was lost in the deleted snapshot.
"We've addressed the error in the EBS snapshot system to prevent it from recurring," the dashboard said.
Nearly 12 hours later, at 2:53 a.m. Pacific time Tuesday, AWS was able to report: "We are continuing to make steady progress in delivering recovery snapshots to affected customers accounts. We will continue to post updates here."
It was not clear when the process would be completed. At 8:06 a.m. Pacific time Tuesday, it reported over half of the volumes that had been in an inconsistent state had been recovered. Customers still needed to run a file-checking system to verify that files recovered were in the format expected by the application.
Amazon's compute service in its U.S. East-1 northern Virginia data center was reported about 7:30 p.m. Pacific time Monday to have momentary connectivity problems, followed by a momentary disconnection of its relational database service. Both were functioning without connectivity issues shortly after 8 p.m. Pacific.
InformationWeek Analytics has published a report on backing up VM disk files and building a resilient infrastructure that can tolerate hardware and software failures. After all, what's the point of constructing a virtualized infrastructure without a plan to keep systems up and running in case of a glitch--or outright disaster? Download the report now. (Free registration required.)