Google Loses Data: Who Says Lightning Never Strikes Twice? - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Cloud // Cloud Storage
News
8/20/2015
09:06 AM
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Google Loses Data: Who Says Lightning Never Strikes Twice?

In a four-strike incident, power to Google Compute Cloud disks in Ghislain, Belgium, gets interupted and data writes are lost.

7 Data Center Disasters You'll Never See Coming
7 Data Center Disasters You'll Never See Coming
(Click image for larger view and slideshow.)

Google experienced high read/write error rates and a small data loss at its Google Compute Engine data center in Ghislain, Belgium, Aug. 13-17 following a storm that delivered four lightning strikes on or near the data center.

Data centers, like other commercial buildings, can be protected from lightning, and Google offered no details as to how its persistent-state disk equipment had been affected by the strikes, other than to say they caused power supply lapses. Emergency power kicked in as planned, but in some cases the battery backup to the disk systems did not perform as expected.

According to a summary of the incident by the Google cloud operations team posted to its Google Cloud Status page: "Although automatic auxiliary systems restored power quickly, and the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain." 

The Cloud Status summary doesn't say whether the repeated strikes led to multiple failures of the power supply to the disks.

[ Want to learn more about another recent Google outage? See Google Cloud Outage: Virtual Networking Breakdown. ]

The summary also did not say the data center was struck four times, as a BBC report on the incident noted. Rather, Google said only that there were "four successive strikes on the electrical systems of a European data center."

That could mean some of the strikes were on utility power substations or telecommunications lines outside the data center, and those strikes affected the equipment in the building. Such strikes could result in voltage surges into a data center. Surge protection is a routine measure at data centers, but it's possible that repeat surges or other unexpected electrical phenomenon caused an equipment failure that led to a temporary power loss. In a State of Iowa data center last year, a power surge or another electrical phenomenon caused a transient voltage surge suppression box to fail, which in turn cut power to the whole data center.

(Image: monkeypics/iStockphoto)

(Image: monkeypics/iStockphoto)

In Google's situation, its summary report said: "In almost all cases the data was successfully committed to stable storage, although manual intervention was required in order to restore the systems to their normal serving state. However, in a very few cases, recent writes were unrecoverable, leading to permanent data loss on the Persistent Disk."

Any loss of data is a serious incident for a cloud service provider, and they take extraordinary measures to prevent it. Data sets are routinely copied three times, so that a hardware failure will still leave two intact copies. But the power interruption in Ghislain caused some data writes to disk to be lost, and it was those write incidents that created the lost data.

As a way of minimizing the loss, the Google summary cited a statistic that represented the amount of persistent disk space that had been affected out of the total available in Ghislain -- "less than 0.000001%." That was a meaningless figure to those customers who happened to be doing frequent read/writes with their systems at the time. A more meaningful figure would have been simply the total amount of data lost in kilobytes, megabytes, or terabytes or the percentage of writes lost.

A lost write in some cases might mean a lost transaction, with lost revenue as well, as opposed to content data that could be called up again from snapshots or other backup storage.

Google apologized for the loss in its summary of the incident: "Google takes availability very seriously, and the durability of storage is our highest priority. We apologize to all our customers who were affected by this exceptional incident."

It added: "We have conducted a thorough analysis of the issue, in which we identified several contributory factors across the full range of our hardware and software technology stack. We are working to improve these to maximize the reliability of GCE's whole storage layer."

Google Cloud Status began reporting an increase in disk read/write error rates in customer use of its "Standard Persistent Disk" at 9:18 p.m. Pacific time Aug. 13, continuing in decreasing frequency through early Aug. 17, when the incident was declared over. Google spokesmen said the company is working to prevent a reoccurrence of the incident.

The mishap in some ways was reminiscent of an electrical storm that passed over Dublin, Ireland, in August 2011, knocking out data centers on which Amazon Web Services and Microsoft Azure both depended. Early statements by Amazon raised questions about the electrical utility's ability to maintain power. But the utility denied several days afterward that any of its facilities had suffered a lightning strike. There was no further clarification of the incident.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
JuliaNoriot
50%
50%
JuliaNoriot,
User Rank: Apprentice
8/27/2015 | 11:04:29 AM
Re: The data loss was in one of three zones available
Yeah it also surprises me, I thought that kind of issue had been already fixed year ago as you mentionned before.
Stratustician
50%
50%
Stratustician,
User Rank: Ninja
8/21/2015 | 2:00:19 PM
Re: The data loss was in one of three zones available
I agree.  You would think that if you are promoting a service that is supposed to be up almost 100% of the time, that you have proper failovers in place.  The sad reality is that often unless they are spread out, replicated in another site, or can react quick enough to failover, it's almost impossible to respond to something like lightening which is sadly really unpredictable.  Nonetheless, it's just another great reason organizations should use replication incase the main data site gets interrupted.
Thomas Claburn
50%
50%
Thomas Claburn,
User Rank: Author
8/20/2015 | 4:40:10 PM
Re: The data loss was in one of three zones available
It surprises me that backup systems fail to work for the one job they wait for. As I recall, there was a SF-based ISP a few years back that had a failure and its backup generators failed to work as expected. Failover switching ought to be more reliable.
babcockcw
50%
50%
babcockcw,
User Rank: Apprentice
8/20/2015 | 2:46:10 PM
The data loss was in one of three zones available
To prevent data loss, Google, like other cloud providers, recommends keeping a back up system in a second zone. At Ghislain, it has three zones available: Europe-West1-b, Europe-West1-c and Europe-West1-d. The disks affected were in West1-b. Drives in c and d remained unaffected and would have provided a quick recovery if customers had replicated systems and data flows there.
Commentary
Study Proposes 5 Primary Traits of Innovation Leaders
Joao-Pierre S. Ruth, Senior Writer,  11/8/2019
Slideshows
Top-Paying U.S. Cities for Data Scientists and Data Analysts
Cynthia Harvey, Freelance Journalist, InformationWeek,  11/5/2019
Slideshows
10 Strategic Technology Trends for 2020
Jessica Davis, Senior Editor, Enterprise Apps,  11/1/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Getting Started With Emerging Technologies
Looking to help your enterprise IT team ease the stress of putting new/emerging technologies such as AI, machine learning and IoT to work for their organizations? There are a few ways to get off on the right foot. In this report we share some expert advice on how to approach some of these seemingly daunting tech challenges.
Slideshows
Flash Poll