Google Cloud Outage: Virtual Networking Breakdown - InformationWeek
IoT
IoT
Cloud // Infrastructure as a Service
News
2/20/2015
05:36 PM
Connect Directly
Twitter
RSS
E-Mail
100%
0%

Google Cloud Outage: Virtual Networking Breakdown

Google's virtual networking software stopped providing routing updates, and customers lost their connections to the outside world.

5 Cloud Contract Traps To Avoid
5 Cloud Contract Traps To Avoid
(Click image for larger view and slideshow.)

A two-hour and 40-minute outage for Google Compute Engine that occurred late Feb. 18 and early Feb. 19 is being described by Google as a breakdown in its data center virtual networking.

The outage occurred after business hours and during the night for most US users. As a result, it was less noticed and complained about on social media than some comparable outages at Amazon Web Services or Microsoft Azure. The outage was most noticeable in Europe as the denial of external network access built up at the start of the business day on Feb. 19, at 7 a.m. GMT in London, and continued until 9:20 a.m. GMT there.

Google took some remedial actions that lessened the impact to about 15% of Compute Engine customers shortly before 1 a.m. PST (9 a.m. GMT), according to a statement about the outage on the Google Compute Engine status page.

Despite the lack of customer outcry, it was "a significant event for someone with the infrastructure investment and maturity of Google," said David Jones, a performance analyst at Dynatrace, a service that monitors hundreds of Internet services from a global network of end-user end points.

(Image: Etonic.net)

(Image: Etonic.net)

A significant glitch in network operations caused first a few, then many, Google Compute Engine users to lose contact between their virtual servers and any attempt by their workloads to connect with the outside world. The servers remained running, but their lack of external connection limited the work they could perform.

The Google Compute Engine status page reported that "a low level loss of outbound network connectivity" started to occur at 10:40 p.m. PST on Feb. 18, with a buildup in severity for an hour and 15 minutes until 11:55 p.m. PST, when 70% of App and Compute engine users were without external connectivity at its peak. Connectivity was restored by 1:20 a.m. PST on Feb. 19, after the business day was well underway in London at 9:20 a.m. GMT.

[ Want to learn about a recent Microsoft Azure outage? See Microsoft Azure Outage Blamed On Bad Code. ]

The outage is an example of how cloud suppliers have implemented software-defined networking and are serving as giant test bed for virtual networking systems, even as enterprise network managers debate its merits. Cloud suppliers must quickly assign a portion of a given physical network as a virtual net to customers as their virtual servers get provisioned.

In the statement about the outage on its Google Cloud Platform status page, Google said: "The majority of Google Compute Engine instances experienced traffic loss for outbound network connectivity," a development it termed "unacceptable." The statement indicated that Compute Engine was affected by the outage more or less uniformly around the world, a conclusion supported by the data collected by third-party monitoring service Dynatrace.

Google apologized for the loss of service and explained what happened this way: As the clock ticked down on the West Coast on Wednesday, Google's virtual networking software stopped updating the network with routing information. The cause of that stoppage isn't known, but the effect was that customer workloads ran for a spell on the cached routing information in the network, then started to slow and lose external connectivity as routes were deleted out of cache.

Google's own Search and Gmail functions appear to have been unaffected.

"We consider Google Compute Engine’s availability over the last 24 hours to be unacceptable, and we apologise if your service was affected by this outage. Today we are completely focused on addressing the incident and its root causes, so that this problem or other hypothetical similar problems cannot recur in the future," said Google in its statement about the incident.

Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
pcharles09
50%
50%
pcharles09,
User Rank: Ninja
2/23/2015 | 10:07:08 PM
Re: how frequent
@jagibbons,

With all the hype about Always On & High Availability, you'd think this could never happen. But with the perfect storm, it seems still very possible.
jagibbons
50%
50%
jagibbons,
User Rank: Ninja
2/23/2015 | 2:08:46 PM
Re: how frequent
We had something similar to us in one of our own datacenters about two months ago. The virtual networking layer failed and it caused a number of our sites to go down for the better part of a day. It was painful, but it only affected us. Fortunately, we aren't providing services for other companies, or it could have been a very costly SLA failure.
anon1104742756
50%
50%
anon1104742756,
User Rank: Apprentice
2/23/2015 | 1:11:20 PM
Re: how frequent
What risks?  All they have at risk is a monthly payment...  (and loosing customers).  The risk is on the customer...  Google's philosphy has generally been redundant cheap low end equipment.  They build their stuff where the plan the equipment to fail, but that it doesn't matter because other equipment will take over.
anon1104742756
50%
50%
anon1104742756,
User Rank: Apprentice
2/23/2015 | 1:00:28 PM
Re: how frequent
There were about 70 outages last year from GCE, but most were only a few minutes and didn't cover as large a %...
nasimson
50%
50%
nasimson,
User Rank: Ninja
2/23/2015 | 10:47:59 AM
Re: how frequent
@Brian:

> However, it would be a huge problem if it was caused due to a limited
> amount of computational resources because, it could imply that the
> economics are not scaling well as the cloud continues to grow

I dont think a company like Google would have saved money by spending less on hardware resources. The risks are not worth the cost savings.
Brian.Dean
50%
50%
Brian.Dean,
User Rank: Ninja
2/22/2015 | 10:20:33 PM
Re: how frequent
Agreed, an outage every 5 years is not a problem, as long as it is a software error that cascaded into a blockage. However, it would be a huge problem if it was caused due to a limited amount of computational resources because, it could imply that the economics are not scaling well as the cloud continues to grow -- hopefully, it was a software error. 
nasimson
50%
50%
nasimson,
User Rank: Ninja
2/22/2015 | 7:36:24 AM
how frequent
Whats worthy to note is that when did the last outage happened? I dont recall one in recent times. If it has been only once in last five years, its not a big deal. If it also happened last quarter (i dont recall any), it should raise eye brows.

 
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
Digital Transformation Myths & Truths
Transformation is on every IT organization's to-do list, but effectively transforming IT means a major shift in technology as well as business models and culture. In this IT Trend Report, we examine some of the misconceptions of digital transformation and look at steps you can take to succeed technically and culturally.
Video
Slideshows
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll