Software error accidentally eliminated data snapshots taken just before Sunday's outage; may complicate some customers' recovery.
Slideshow: Amazon's Case For Enterprise Cloud Computing
(click image for larger view and for full slideshow)
An error embedded in a piece of Amazon Web Services cleanup software has resulted in some customers having their backup data snapshots deleted from EC2's European data center.
Amazon dashboard notices of the problem indicate most of the data was recoverable but it's not clear whether that happened in every instance.
The error appears to have been discovered as the Dublin data center was in the process of helping customers recover from a power outage that struck both the primary and backup supplies of electricity to the Dublin, Ireland data center, known as EU-West-1, Sunday.
The Elastic Block Store cleanup process eliminates unused snapshots of an application's data. The snapshots are created as temporary backup copies during an application's run, in case data is lost during processing. When a system failure occurs, such as the power outage the day before, they serve as a nearly up-to-date source for rebuilding data.
Neither the software error or its malfunction Monday were directly related to the power outage the day before. Primary power was lost due to a lightning strike on a transformer and a resulting fire. Backup power did not kick in because the control system for the backup generators was knocked out by the strike as well. But discovery of an error in cleanup systems the next day means the difficulties some customers are experiencing to recover their systems have been compounded. It's possible that one or more customers expecting to rely on the snapshots will learn that the backup data was inadvertently eliminated.
At 3:11 p.m. Pacific time Monday, Amazon notified European customers via its Service Health Dashboard that blocks of snapshot data "were incorrectly deleted" by EBS service cleanup software. "The root cause was a software error ..." the Service Health Dashboard notice said.
As the EBS cleanup system did its work, the error caused references to some blocks "to be missed during reference counting process," which prompted the EBS snapshot management system to conclude that those blocks weren't being used and should be deleted. In many cases, other snapshots of the same data were saved. Those snapshots may be either older or younger versions than those deleted. If older, that means recent data was lost in the deleted snapshot.
"We've addressed the error in the EBS snapshot system to prevent it from recurring," the dashboard said.
Nearly 12 hours later, at 2:53 a.m. Pacific time Tuesday, AWS was able to report: "We are continuing to make steady progress in delivering recovery snapshots to affected customers accounts. We will continue to post updates here."
It was not clear when the process would be completed. At 8:06 a.m. Pacific time Tuesday, it reported over half of the volumes that had been in an inconsistent state had been recovered. Customers still needed to run a file-checking system to verify that files recovered were in the format expected by the application.
Amazon's compute service in its U.S. East-1 northern Virginia data center was reported about 7:30 p.m. Pacific time Monday to have momentary connectivity problems, followed by a momentary disconnection of its relational database service. Both were functioning without connectivity issues shortly after 8 p.m. Pacific.
InformationWeek Analytics has published a report on backing up VM disk files and building a resilient infrastructure that can tolerate hardware and software failures. After all, what's the point of constructing a virtualized infrastructure without a plan to keep systems up and running in case of a glitch--or outright disaster? Download the report now. (Free registration required.)
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.