Government // Enterprise Architecture
News
9/2/2009
01:39 PM
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

Gmail Outage 'A Big Deal,' Says Google

Aiming to blunt criticism from competitors and cloud computing detractors, Google insists it's taking the Gmail outage very seriously.

Following up on Tuesday's 100-minute Gmail outage, Google engineering VP and "Site Reliability Czar" Ben Treynor published a blog post on Tuesday evening apologizing for the service downtime and elaborating on some of the steps Google is taking to prevent the situation from happening again.

"We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there's a problem with the service," sad Treynor. "Thus, right up front, I'd like to apologize to all of you -- today's outage was a Big Deal, and we're treating it as such."

The problem arose, Treynor explains, when Google took a small number of Gmail servers offline to perform maintenance and underestimated the change in traffic load that would place on the request routers, which send Web queries to the appropriate Gmail server. The overloaded routers began telling Google's infrastructure to send the traffic elsewhere, but the result was a cascading failure -- there simply wasn't enough router capacity to manage the traffic.

IMAP/POP access to Gmail, however, remained unaffected because those protocols don't rely on the same Web request routers.

Google brought additional request routers online to add more capacity and is taking steps to establish stronger failure isolation at its data centers, so overloads in one place don't create capacity problems elsewhere.

"We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements -- Gmail remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity," concluded Treynor.

Treynor's apology and insistence that the outage is a big deal may represent the sort of mea culpa that users have come to expect -- Amazon issued a similarly detailed apology and analysis of a Amazon Web Services failure in July 2008 -- but such shows of concern suggest that downtime is more unusual and problematic than it really is.

As Forrester analyst Sheri McLeish told InformationWeek in February following a longer Gmail outage, Gmail is no worse in terms of reliability than most other hosted services or Exchange software managed by internal IT staff.

Indeed, a white paper titled "Preventing Your Next Microsoft Exchange Outage," published last year by disaster recovery software maker appAssure, states, "Despite the complexity of Exchange environments, outages can be minimal, affecting only a few mailboxes; or, they can be substantial, bringing entire systems offline. Despite the fact that most organizations maintain service level agreements and outage recovery plans, few are able to prevent all outages; and when they do occur, the recovery window often exceeds even the most conservative planning."

Availability of 99.9% translates into between eight and nine hours of downtime per year. So five or six Gmail disruptions to match Tuesday's can be expected this year. Downtime happens.

Other industries that are more critical to businesses and consumers such as power utilities tend to do better than that, but even they may experience similar service availability problems. In Northern California last year, for example, the average PG&E customer was without power for almost seven hours.

This won't be the last Gmail outage. But that shouldn't color one's perception of cloud computing.

InformationWeek has published an in-depth report on managing risk. Download the report here (registration required).

Comment  | 
Print  | 
More Insights
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.