Google Cloud Fail Points To 2 Software Bugs

The Google Compute Engine cloud service was hit with an 18-minute outage worldwide earlier in the week -- at the same time the supply chain is finally getting comfortable with continuity in the cloud.

Dawn Kawamoto, Associate Editor, Dark Reading

April 14, 2016

3 Min Read

<p align="left">(Image: 4X-image/iStockphoto)</p>

7 Reasons To Convert To A Private Cloud

7 Reasons To Convert To A Private Cloud (Click image for larger view and slideshow.)

Google issued an apology Wednesday for its cloud outage that swept from its Asia region through its entire global network over the course of an hour earlier in the week. The entire network was down for 18 minutes.

The company's Google Compute Engine (GCE), which allows users to create and run virtual machines on the Google cloud platform, started to route inbound traffic incorrectly in its "asia-east1" region on Monday at 6:25 pm PST. That resulted in dropped connections and the inability for users to reconnect.

The problem stemmed from Google engineers' efforts to remove an unused GCE IP block from its network configuration and propagate the new configuration throughout its global network. Although that task had been performed many times before without incident, a snag occurred when the configuration management software found an inconsistency in the new configuration, according to Google.

Instead of the usual fail-safe move of the system returning to the last known good configuration, an unforeseen bug in the software triggered the management software to remove all of the IP blocks from the new configuration. It then began to push the incomplete configuration throughout the entire global system.

Normally, a second safeguard measure ensures the system is running fine at a single site before a new configuration is pushed out to the next site. However, in this case, a second software bug that should have kept the problem contained at one site allowed it to progressively rollout through Google's entire cloud system worldwide.

By the time Google was able to resolve the rolling outage, an hour later, 95% of its cloud system was down.

Fortunately for users of its Google Cloud Storage, Google App Engine, and other Google Cloud Platform products, the outage did not affect them. It only affected the Google Compute Engine service.

"We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect of such outages," Google said in a statement. "This incident report is both longer and more detailed than usual precisely because we consider the April 11 event so important, and we want you to understand why it happened and what we are doing about it."

Learn to integrate the cloud into legacy systems and new initiatives. Attend the Cloud Connect Track at Interop Las Vegas, May 2-6. Register now!

Over the next several weeks, Google's engineers will be working on prevention, detection and mitigation systems to develop additional safeguards, the company said.

Nonetheless, high-profile cloud outages like this come at an unfortunate time. Supply chain vendors, which have been the slowest adopters of the cloud, have finally started coming aboard over the last several years.

One of the two main concerns in making that decision was the ability of the cloud to provide continuity in its service, according to an Oracle survey cited in SupplyChainBrain.

As supply chain vendors adopted the cloud, adoption would typically begin with less business-critical operations, like human resources or enterprise resource planning (ERP). Eventually more business critical services would be added, according to the report.

In 2015, Oracle found that 80% of survey participants were running applications in the cloud, or were planning to make the move within the next 12 months, whereas the reverse was the case just three years earlier, SupplyChainBrain noted, citing the Oracle survey.

Past high-profile outages such as Amazon Web Services in September, apparently did not dissuade companies from turning to the cloud. Even after Google's latest snafu, the same may still hold true.

About the Author(s)

Dawn Kawamoto

Associate Editor, Dark Reading

Dawn Kawamoto is an Associate Editor for Dark Reading, where she covers cybersecurity news and trends. She is an award-winning journalist who has written and edited technology, management, leadership, career, finance, and innovation stories for such publications as CNET's News.com, TheStreet.com, AOL's DailyFinance, and The Motley Fool. More recently, she served as associate editor for technology careers site Dice.com.

See more from Dawn Kawamoto

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

Google Cloud Fail Points To 2 Software Bugs

About the Author(s)

Editor's Choice