Google issued an apology Wednesday for its cloud outage that swept from its Asia region through its entire global network over the course of an hour earlier in the week. The entire network was down for 18 minutes.
The company's Google Compute Engine (GCE), which allows users to create and run virtual machines on the Google cloud platform, started to route inbound traffic incorrectly in its "asia-east1" region on Monday at 6:25 pm PST. That resulted in dropped connections and the inability for users to reconnect.
The problem stemmed from Google engineers' efforts to remove an unused GCE IP block from its network configuration and propagate the new configuration throughout its global network. Although that task had been performed many times before without incident, a snag occurred when the configuration management software found an inconsistency in the new configuration, according to Google.
Instead of the usual fail-safe move of the system returning to the last known good configuration, an unforeseen bug in the software triggered the management software to remove all of the IP blocks from the new configuration. It then began to push the incomplete configuration throughout the entire global system.
Normally, a second safeguard measure ensures the system is running fine at a single site before a new configuration is pushed out to the next site. However, in this case, a second software bug that should have kept the problem contained at one site allowed it to progressively rollout through Google's entire cloud system worldwide.
By the time Google was able to resolve the rolling outage, an hour later, 95% of its cloud system was down.
Fortunately for users of its Google Cloud Storage, Google App Engine, and other Google Cloud Platform products, the outage did not affect them. It only affected the Google Compute Engine service.
"We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect of such outages," Google said in a statement. "This incident report is both longer and more detailed than usual precisely because we consider the April 11 event so important, and we want you to understand why it happened and what we are doing about it."
Over the next several weeks, Google's engineers will be working on prevention, detection and mitigation systems to develop additional safeguards, the company said.
Nonetheless, high-profile cloud outages like this come at an unfortunate time. Supply chain vendors, which have been the slowest adopters of the cloud, have finally started coming aboard over the last several years.
One of the two main concerns in making that decision was the ability of the cloud to provide continuity in its service, according to an Oracle survey cited in SupplyChainBrain.
As supply chain vendors adopted the cloud, adoption would typically begin with less business-critical operations, like human resources or enterprise resource planning (ERP). Eventually more business critical services would be added, according to the report.
In 2015, Oracle found that 80% of survey participants were running applications in the cloud, or were planning to make the move within the next 12 months, whereas the reverse was the case just three years earlier, SupplyChainBrain noted, citing the Oracle survey.
Past high-profile outages such as Amazon Web Services in September, apparently did not dissuade companies from turning to the cloud. Even after Google's latest snafu, the same may still hold true.