Gmail Disruption Blamed On Storage Software Snafu

Google is promising to recover lost messages and to restore service for affected users soon.
Top 10 Google Stories Of 2010
(click image for larger view)
Slideshow: Top 10 Google Stories Of 2010

Google on Monday evening reported that it had identified why a small percentage of Gmail users have been unable to access their messages since Sunday: buggy software.

"We released a storage software update that introduced the unexpected bug, which caused 0.02% of Gmail users to temporarily lose access to their e-mail," said Google engineering VP Ben Treynor in a blog post. "When we discovered the problem, we immediately stopped the deployment of the new software and reverted to the old version."

Unstable software in one of Google's European data centers caused a less lengthy service disruption in 2009. In the wake of that outage, Google introduced its App Status Dashboard to give customers more visibility into its operations.

Google has promised to remedy the situation as soon as possible, but that's taking longer than usual because the flaw wiped out affected customers' e-mail in multiple data centers, forcing Google's engineers to restore accounts from backup tapes. Restoring data from tapes rather than a nearby data center takes hours instead of milliseconds, explained Treynor.

The App Status Dashboard on Tuesday indicates that the recovery operation is ongoing. The message posted at 8:10 am says, "Google Mail service has already been restored for some users, and we expect a resolution for all users in the near future. Please note this time frame is an estimate and may change. At the moment, we are working on restoring the affected accounts. Once the restore is complete, we will start reinstating accounts and delivering messages."

Google has declined to provide an absolute number of users affected. Based on recent estimates that the Gmail user base has reached 170 million worldwide, 0.02% works out to about 34,000 users.

Google did not immediately respond to a request to provide information about whether any users covered by the Google Apps for Business service level agreement will receive credit for downtime.

Update: At 12:20 pm PST, Google amended its blog post about the status of its recovery efforts.

"Data for the remaining 0.012% of affected users has been successfully restored from tapes and is now being processed," the update says. "We plan to begin moving data into mailboxes in 2 hours, and in the hours that follow users will regain access to their data. Accounts with more mail will take more time. Thanks for bearing with us."

Google also confirmed that for the very small number of business customers affected, SLA credit will be issued. The company plans to publish an incident report with additional details in the next day or two.

Editor's Choice
John Edwards, Technology Journalist & Author
Carrie Pallardy, Contributing Reporter
Alan Brill, Senior Managing Director, Cyber Risk, Kroll
John Bennett, Global Head of Government Affairs, Cyber Risk, Kroll
Sponsored by Lookout, Sundaram Lakshmanan, Chief Technology Officer
Brandon Taylor, Digital Editorial Program Manager
Jessica Davis, Senior Editor
Richard Pallardy, Freelance Writer
Sponsored by Lookout, Sundaram Lakshmanan, Chief Technology Officer
Sara Peters, Editor-in-Chief, InformationWeek / Network Computing