Google's virtual networking software stopped providing routing updates, and customers lost their connections to the outside world.
5 Cloud Contract Traps To Avoid
(Click image for larger view and slideshow.)
A two-hour and 40-minute outage for Google Compute Engine that occurred late Feb. 18 and early Feb. 19 is being described by Google as a breakdown in its data center virtual networking.
The outage occurred after business hours and during the night for most US users. As a result, it was less noticed and complained about on social media than some comparable outages at Amazon Web Services or Microsoft Azure. The outage was most noticeable in Europe as the denial of external network access built up at the start of the business day on Feb. 19, at 7 a.m. GMT in London, and continued until 9:20 a.m. GMT there.
Google took some remedial actions that lessened the impact to about 15% of Compute Engine customers shortly before 1 a.m. PST (9 a.m. GMT), according to a statement about the outage on the Google Compute Engine status page.
Despite the lack of customer outcry, it was "a significant event for someone with the infrastructure investment and maturity of Google," said David Jones, a performance analyst at Dynatrace, a service that monitors hundreds of Internet services from a global network of end-user end points.
A significant glitch in network operations caused first a few, then many, Google Compute Engine users to lose contact between their virtual servers and any attempt by their workloads to connect with the outside world. The servers remained running, but their lack of external connection limited the work they could perform.
The Google Compute Engine status page reported that "a low level loss of outbound network connectivity" started to occur at 10:40 p.m. PST on Feb. 18, with a buildup in severity for an hour and 15 minutes until 11:55 p.m. PST, when 70% of App and Compute engine users were without external connectivity at its peak. Connectivity was restored by 1:20 a.m. PST on Feb. 19, after the business day was well underway in London at 9:20 a.m. GMT.
The outage is an example of how cloud suppliers have implemented software-defined networking and are serving as giant test bed for virtual networking systems, even as enterprise network managers debate its merits. Cloud suppliers must quickly assign a portion of a given physical network as a virtual net to customers as their virtual servers get provisioned.
In the statement about the outage on its Google Cloud Platform status page, Google said: "The majority of Google Compute Engine instances experienced traffic loss for outbound network connectivity," a development it termed "unacceptable." The statement indicated that Compute Engine was affected by the outage more or less uniformly around the world, a conclusion supported by the data collected by third-party monitoring service Dynatrace.
Google apologized for the loss of service and explained what happened this way: As the clock ticked down on the West Coast on Wednesday, Google's virtual networking software stopped updating the network with routing information. The cause of that stoppage isn't known, but the effect was that customer workloads ran for a spell on the cached routing information in the network, then started to slow and lose external connectivity as routes were deleted out of cache.
Google's own Search and Gmail functions appear to have been unaffected.
"We consider Google Compute Engine’s availability over the last 24 hours to be unacceptable, and we apologise if your service was affected by this outage. Today we are completely focused on addressing the incident and its root causes, so that this problem or other hypothetical similar problems cannot recur in the future," said Google in its statement about the incident.
Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.
Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
The Next Generation of IT SupportThe workforce is changing as businesses become global and technology erodes geographical and physical barriers.IT organizations are critical to enabling this transition and can utilize next-generation tools and strategies to provide world-class support regardless of location, platform or device