Aiming to blunt criticism from competitors and cloud computing detractors, Google insists it's taking the Gmail outage very seriously.
Following up on Tuesday's 100-minute Gmail outage, Google engineering VP and "Site Reliability Czar" Ben Treynor published a blog post on Tuesday evening apologizing for the service downtime and elaborating on some of the steps Google is taking to prevent the situation from happening again.
"We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there's a problem with the service," sad Treynor. "Thus, right up front, I'd like to apologize to all of you -- today's outage was a Big Deal, and we're treating it as such."
The problem arose, Treynor explains, when Google took a small number of Gmail servers offline to perform maintenance and underestimated the change in traffic load that would place on the request routers, which send Web queries to the appropriate Gmail server. The overloaded routers began telling Google's infrastructure to send the traffic elsewhere, but the result was a cascading failure -- there simply wasn't enough router capacity to manage the traffic.
IMAP/POP access to Gmail, however, remained unaffected because those protocols don't rely on the same Web request routers.
Google brought additional request routers online to add more capacity and is taking steps to establish stronger failure isolation at its data centers, so overloads in one place don't create capacity problems elsewhere.
"We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements -- Gmail remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity," concluded Treynor.
Treynor's apology and insistence that the outage is a big deal may represent the sort of mea culpa that users have come to expect -- Amazon issued a similarly detailed apology and analysis of a Amazon Web Services failure in July 2008 -- but such shows of concern suggest that downtime is more unusual and problematic than it really is.
As Forrester analyst Sheri McLeish told InformationWeek in February following a longer Gmail outage, Gmail is no worse in terms of reliability than most other hosted services or Exchange software managed by internal IT staff.
Indeed, a white paper titled "Preventing Your Next Microsoft Exchange Outage," published last year by disaster recovery software maker appAssure, states, "Despite the complexity of Exchange environments, outages can be minimal, affecting only a few mailboxes; or, they can be substantial, bringing entire systems offline. Despite the fact that most organizations maintain service level agreements and outage recovery plans, few are able to prevent all outages; and when they do occur, the recovery window often exceeds even the most conservative planning."
Availability of 99.9% translates into between eight and nine hours of downtime per year. So five or six Gmail disruptions to match Tuesday's can be expected this year. Downtime happens.
Other industries that are more critical to businesses and consumers such as power utilities tend to do better than that, but even they may experience similar service availability problems. In Northern California last year, for example, the average PG&E customer was without power for almost seven hours.
This won't be the last Gmail outage. But that shouldn't color one's perception of cloud computing.