Global CIO: IBM's Bank Outage: Anatomy Of A Disaster
IBM personnel inadvertently triggered a 7-hour outage at Singapore's largest banking network last month by using unapproved procedures. Here's a detailed look at what went wrong.
At 2:58 a.m. on the morning of July 5, disaster struck for Singapore's largest banking network and a small IBM support team troubleshooting a communications problem between one of the bank's storage devices and its mainframe. Here's how the bank, DBS Group, and its IT services provider, IBM, described that harrowing moment, which occurred after 40 hours of on-and-off attempts to fix the problem:
"[The on-site IBM engineer] replaced the cable using the same procedures as before. This caused errors that threatened data integrity. As a result, the storage system automatically stopped communicating with the mainframe computer, to protect the data. At this point, DBS banking services were disrupted."
Disrupted, indeed: for the following seven hours—until 10:00 a.m.—DBS's customers were unable to access banking services via branches, ATMs, online, or mobile. One media report said DBS has about 1,000 ATMs and, early in 2008, had almost 1,000,000 online-banking customers.
And while bank CEO Piyush Gupta 8 days later issued a long and deeply apologetic letter to DBS Group's customers in which he took full responsibility for the outage and resulting customer inconvenience and loss of trust, IBM has also been embroiled in the controversy over what happened, why it happened, and how it can be prevented in the future.
What we do know at this point is that the outage will cost DBS Group a great deal more than the negative impact on customers as the government agency that oversees the banking industry, the Monetary Authority of Singapore, has ordered DBS to place $230 million in regulatory capital as a result of the outage.
What is not known—at least publicly—is whether IBM will have to compensate DBS for the cost of the outage and/or related costs. In two separate media reports from a joint press conference involving both DBS and IBM, both articles said the companies would not answer questions about that possibility.
According to ChannelNewsAsia.com's coverage of the press conference, IBM regional general manager Cordelia Chung said that "the personnel directly involved with this incident have been removed from direct customer support activity and disciplined" and that "IBM has taken steps to enhance the training of all related personnel on the most current procedures."
And the BusinessTimes.com.sg article quoted Chung as saying, "We have also taken steps to review installations of the same storage system at other financial institutions in Singapore for whom we provide maintenance services."
(If I may inject an opinion here: if I were DBS CEO Gupta, I'd have very mixed feelings about that preventive-maintenance approach taken by IBM not only on behalf of DBS's competitors but also on the shoulders of a very troublesome incident for DBS. That's why I would guess that while neither company would comment on whether IBM would be compensating DBS for its role in the crash, IBM's going to be paying DBS very generously in either cash or extended and comprehensive additional services. On top of that, the removal from the involved employees from future dealings with customers, plus disciplinary action against them, plus Chung's very public apologies to both DBS and its customers all seem to add up to a very uncomplicated admission by IBM of some level of culpability in the outage.)
The article goes on to quote Chung as describing IBM's top priority once they realized a crash had occurred:
Google in the Enterprise SurveyThere's no doubt Google has made headway into businesses: Just 28 percent discourage or ban use of its productivity products, and 69 percent cite Google Apps' good or excellent mobility. But progress could still stall: 59 percent of nonusers distrust the security of Google's cloud. Its data privacy is an open question, and 37 percent worry about integration.
. We've got a management crisis right now, and we've also got an engagement crisis. Could the two be linked? Tune in for the next installment of IT Life Radio, Wednesday May 20th at 3PM ET to find out.