Google Cloud Fail Points To 2 Software Bugs - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Google Cloud Fail Points To 2 Software Bugs

The Google Compute Engine cloud service was hit with an 18-minute outage worldwide earlier in the week -- at the same time the supply chain is finally getting comfortable with continuity in the cloud.

7 Reasons To Convert To A Private Cloud
7 Reasons To Convert To A Private Cloud
(Click image for larger view and slideshow.)

Google issued an apology Wednesday for its cloud outage that swept from its Asia region through its entire global network over the course of an hour earlier in the week. The entire network was down for 18 minutes.

The company's Google Compute Engine (GCE), which allows users to create and run virtual machines on the Google cloud platform, started to route inbound traffic incorrectly in its "asia-east1" region on Monday at 6:25 pm PST. That resulted in dropped connections and the inability for users to reconnect.

The problem stemmed from Google engineers' efforts to remove an unused GCE IP block from its network configuration and propagate the new configuration throughout its global network. Although that task had been performed many times before without incident, a snag occurred when the configuration management software found an inconsistency in the new configuration, according to Google.

(Image: 4X-image/iStockphoto)

(Image: 4X-image/iStockphoto)

Instead of the usual fail-safe move of the system returning to the last known good configuration, an unforeseen bug in the software triggered the management software to remove all of the IP blocks from the new configuration. It then began to push the incomplete configuration throughout the entire global system.

Normally, a second safeguard measure ensures the system is running fine at a single site before a new configuration is pushed out to the next site. However, in this case, a second software bug that should have kept the problem contained at one site allowed it to progressively rollout through Google's entire cloud system worldwide.

By the time Google was able to resolve the rolling outage, an hour later, 95% of its cloud system was down.

Fortunately for users of its Google Cloud Storage, Google App Engine, and other Google Cloud Platform products, the outage did not affect them. It only affected the Google Compute Engine service.

"We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect of such outages," Google said in a statement. "This incident report is both longer and more detailed than usual precisely because we consider the April 11 event so important, and we want you to understand why it happened and what we are doing about it."

Learn to integrate the cloud into legacy systems and new initiatives. Attend the Cloud Connect Track at Interop Las Vegas, May 2-6. Register now!

Over the next several weeks, Google's engineers will be working on prevention, detection and mitigation systems to develop additional safeguards, the company said.

Nonetheless, high-profile cloud outages like this come at an unfortunate time. Supply chain vendors, which have been the slowest adopters of the cloud, have finally started coming aboard over the last several years.

One of the two main concerns in making that decision was the ability of the cloud to provide continuity in its service, according to an Oracle survey cited in SupplyChainBrain.

As supply chain vendors adopted the cloud, adoption would typically begin with less business-critical operations, like human resources or enterprise resource planning (ERP). Eventually more business critical services would be added, according to the report.

In 2015, Oracle found that 80% of survey participants were running applications in the cloud, or were planning to make the move within the next 12 months, whereas the reverse was the case just three years earlier, SupplyChainBrain noted, citing the Oracle survey.

Past high-profile outages such as Amazon Web Services in September, apparently did not dissuade companies from turning to the cloud. Even after Google's latest snafu, the same may still hold true.

Dawn Kawamoto is an Associate Editor for Dark Reading, where she covers cybersecurity news and trends. She is an award-winning journalist who has written and edited technology, management, leadership, career, finance, and innovation stories for such publications as CNET's ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

IT Leadership: 10 Ways to Unleash Enterprise Innovation
Lisa Morgan, Freelance Writer,  6/8/2021
Preparing for the Upcoming Quantum Computing Revolution
John Edwards, Technology Journalist & Author,  6/3/2021
How SolarWinds Changed Cybersecurity Leadership's Priorities
Jessica Davis, Senior Editor, Enterprise Apps,  5/26/2021
White Papers
Register for InformationWeek Newsletters
2021 State of ITOps and SecOps Report
2021 State of ITOps and SecOps Report
This new report from InformationWeek explores what we've learned over the past year, critical trends around ITOps and SecOps, and where leaders are focusing their time and efforts to support a growing digital economy. Download it today!
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll