August 20, 2015
7 Data Center Disasters You'll Never See Coming
7 Data Center Disasters You'll Never See Coming (Click image for larger view and slideshow.)
Google experienced high read/write error rates and a small data loss at its Google Compute Engine data center in Ghislain, Belgium, Aug. 13-17 following a storm that delivered four lightning strikes on or near the data center.
Data centers, like other commercial buildings, can be protected from lightning, and Google offered no details as to how its persistent-state disk equipment had been affected by the strikes, other than to say they caused power supply lapses. Emergency power kicked in as planned, but in some cases the battery backup to the disk systems did not perform as expected.
According to a summary of the incident by the Google cloud operations team posted to its Google Cloud Status page: "Although automatic auxiliary systems restored power quickly, and the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain."
The Cloud Status summary doesn't say whether the repeated strikes led to multiple failures of the power supply to the disks.
[ Want to learn more about another recent Google outage? See Google Cloud Outage: Virtual Networking Breakdown. ]
The summary also did not say the data center was struck four times, as a BBC report on the incident noted. Rather, Google said only that there were "four successive strikes on the electrical systems of a European data center."
That could mean some of the strikes were on utility power substations or telecommunications lines outside the data center, and those strikes affected the equipment in the building. Such strikes could result in voltage surges into a data center. Surge protection is a routine measure at data centers, but it's possible that repeat surges or other unexpected electrical phenomenon caused an equipment failure that led to a temporary power loss. In a State of Iowa data center last year, a power surge or another electrical phenomenon caused a transient voltage surge suppression box to fail, which in turn cut power to the whole data center.
In Google's situation, its summary report said: "In almost all cases the data was successfully committed to stable storage, although manual intervention was required in order to restore the systems to their normal serving state. However, in a very few cases, recent writes were unrecoverable, leading to permanent data loss on the Persistent Disk."
Any loss of data is a serious incident for a cloud service provider, and they take extraordinary measures to prevent it. Data sets are routinely copied three times, so that a hardware failure will still leave two intact copies. But the power interruption in Ghislain caused some data writes to disk to be lost, and it was those write incidents that created the lost data.
As a way of minimizing the loss, the Google summary cited a statistic that represented the amount of persistent disk space that had been affected out of the total available in Ghislain -- "less than 0.000001%." That was a meaningless figure to those customers who happened to be doing frequent read/writes with their systems at the time. A more meaningful figure would have been simply the total amount of data lost in kilobytes, megabytes, or terabytes or the percentage of writes lost.
A lost write in some cases might mean a lost transaction, with lost revenue as well, as opposed to content data that could be called up again from snapshots or other backup storage.
Google apologized for the loss in its summary of the incident: "Google takes availability very seriously, and the durability of storage is our highest priority. We apologize to all our customers who were affected by this exceptional incident."
It added: "We have conducted a thorough analysis of the issue, in which we identified several contributory factors across the full range of our hardware and software technology stack. We are working to improve these to maximize the reliability of GCE's whole storage layer."
Google Cloud Status began reporting an increase in disk read/write error rates in customer use of its "Standard Persistent Disk" at 9:18 p.m. Pacific time Aug. 13, continuing in decreasing frequency through early Aug. 17, when the incident was declared over. Google spokesmen said the company is working to prevent a reoccurrence of the incident.
The mishap in some ways was reminiscent of an electrical storm that passed over Dublin, Ireland, in August 2011, knocking out data centers on which Amazon Web Services and Microsoft Azure both depended. Early statements by Amazon raised questions about the electrical utility's ability to maintain power. But the utility denied several days afterward that any of its facilities had suffered a lightning strike. There was no further clarification of the incident.
About the Author(s)
You May Also Like
Oct 2023 Threat Horizons Report
KVM Switch High Performance Applications with Dominion KX III
Implementing Privacy by Design into Information Systems
Best Practices for Modern Data Management in Banking: Compliance & Capital Without Compromise
The Definitive Guide to Understanding IP Addresses, VPNs and their Implications for Businesses