Amazon needed disaster recovery capability with live data replication to be in place for many customers to avoid being caught in the outage.
Both primary and secondary power supplies were knocked out in the same lightning strike during an intense electrical storm Sunday over the city of Dublin, where Amazon operates its European zone data center. The strike caused a transformer explosion and fire in the grid of Amazon's electricity supplier; the same strike also knocked out Amazon's backup generators.
An "electric deviation" caused by the strike traveled along the power feed wires to knock out the control system that would normally have triggered backup generators in the data center, Amazon operators reported in the EC2 cloud's Service Health Dashboard for European users.
Amazon and other data center operators take precautions to protect against lightning strikes, said Indu Kodukula, CTO of SunGard Availability Services, a disaster recovery specialist firm. But a direct strike on the power supplier's transformer "is a thing you pray never happens to you," he noted.
The strike also affected a Microsoft data center powering its Business Productivity Online Suite of applications, according to DataCenterKnowledge.com, a data center operations site.
Amazon itself explained on its Service Health Dashboard: "Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them."
In response to InformationWeek inquiries, Amazon Web Services said, "We are planning to publish a post mortem with more details," much as it did after a misaligned network brought down several EC2 services in its northern Virginia data center over April's Easter weekend.
To avoid being caught in the European outage, Amazon customers would have had to take extraordinary measures to protect themselves before the incident occurred, said Kodukula.
It's still possible that having the ability to fail-over to a second availability zone within the data center would have saved a customer's system. Availability zones within an Amazon data center typically have different sources of power and telecommunications, allowing one to fail and others to pick up parts of its load. But not everyone has signed up for failover service to a second zone, and Amazon spokesman Drew Herdener declined to say whether secondary zones remained available in Dublin after the primary zone outage.
In the April outage in Amazon's U.S. East region, cloud services in secondary zones failed after the primary zone went down, triggering "a re-mirroring storm." In such an incident, the sudden loss of access to many users' data causes automated systems to try to duplicate the data elsewhere, tying up all available resources.
Some companies now employ a form of disaster recovery that stores a duplicate set of virtual machines at a separate site; they're started up in the event of failure at the primary site. But Kodukula said such a process takes several minutes to get systems started at an alternative site. It also results in loss of several minutes worth of data.
Another alternative is to set up a data replication system to feed real-time data into the second site. If systems are kept running continuously, they can pick up the work of the failed systems with a minimum of data loss, he said. But companies need to employ their coordination expertise to make such a system work, and some data may still be lost.
SunGard and other parties are known to be working on specialized services in the cloud that will ease the problem of establishing backup systems and activating them in case of failure. But no such services have been announced yet.
Automation and orchestration technologies can make IT more efficient and better able to serve the business by streamlining common tasks and speeding service delivery. In this report, we outline the potential snags and share strategies and best practices to ensure successful implementation. Download our report here. (Free registration required.)