Global CIO: IBM's Bank Outage: Anatomy Of A Disaster

IBM personnel inadvertently triggered a 7-hour outage at Singapore's largest banking network last month by using unapproved procedures. Here's a detailed look at what went wrong.

Bob Evans, Contributor

August 4, 2010

4 Min Read

"The immediate priority was to ensure that customer data was not in any way compromised while restoring services as quickly as possible."

Both media reports said Chung apologized to both DBS and its customers for the outage.

IBM and DBS are nearing the end of a 10-year, $1.2 billion outsourcing deal covering networks and mainframes, and the local news reports said the companies would not comment on whether they expect that contract to be renewed, extended, or dropped.

DBS Group CEO Gupta's July 13 letter to customers included this summary of what triggered the crash:

"A component replacement was scheduled for 3 am, a quiet period, which is standard operating procedure. Unfortunately, while IBM was conducting this routine replacement, under the guidance of their Asia Pacific team, a procedural error inadvertently triggered a malfunction in the multiple layers of systems redundancies, which led to the outage. The IBM Asia Pacific team is the central support unit for all IBM storage systems in the region, irrespective of whether the installation has been outsourced or is being managed in-house.

"I am treating this matter with utmost priority and the full scale investigation that we initiated last week is still underway. This investigation is being done with the support of IBM’s labs in the U.S and their engineering teams in Asia. So far, we understand from IBM that an outdated procedure was used to carry out the repair."

For a more detailed look at what went wrong, here's a summary timeline compiled by the BusinessTimes.com.sg, citing DBS and IBM as its sources:

"July 3, 11:06 a.m.: IBM software monitoring tools sent an alert message to IBM's Asia-Pacific support sentre outside Singapore, signalling an instability in a communications link in the storage system connected to DBS's mainframe computer. An IBM field engineer was despatched to the DBS data centre.

"July 3, 7:50 p.m.: The engineer replaced a cable, not using the maintenance instructions on the machine, but those given by the support centre staff. Although this was done using an incorrect step, the error message ceased.

"July 4, 2:55 p.m.: The error message reappeared, this time indicating instability in the cable and associated electronic cards. The IBM engineer was despatched again to the data centre. He asked the regional IBM support centre for advice.

"July 4, 5:16 p.m.: Following instructions for the support centre staff, the engineer removed the cable for inspection and put it back using the same incorrect step. The error message ceased.

"July 4, 6:14 p.m.: The error message reappeared. Over the next 5 hours and 22 minutes, the regional IBM support centre analysed the log from the machine and recommended to the engineer that he unplug the cable and look for a bent pin. Throughout all this, the storage system was still functioning.

"July 4, 11:38 p.m.: The engineer did not find a bent pin and put the cable back. The error message persisted. The regional support centre and the engineer continued trying to uncover the problem, including unplugging the cable and putting it back, again. DBS was contacted and authorised a cable change at 2:50am, a quiet period. While waiting to replace the cable, the IBM engineer decided to inspect the cable again for defects and check that it was installed properly. He unplugged the cable, again using the incorrect procedure advised by the regional support centre staff.

"July 5, 2:58 a.m.: He replaced the cable using the same procedures as before. This caused errors that threatened data integrity. As a result, the storage system automatically stopped communicating with the mainframe computer, to protect the data. At this point, DBS banking services were disrupted."

About the Author(s)

Bob Evans

Contributor

Bob Evans is senior VP, communications, for Oracle Corp. He is a former InformationWeek editor.

See more from Bob Evans

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

Global CIO: IBM's Bank Outage: Anatomy Of A Disaster

About the Author(s)

Editor's Choice