Global CIO: IBM's Bank Outage: Anatomy Of A Disaster
IBM personnel inadvertently triggered a 7-hour outage at Singapore's largest banking network last month by using unapproved procedures. Here's a detailed look at what went wrong.
"The immediate priority was to ensure that customer data was not in any way compromised while restoring services as quickly as possible."
Both media reports said Chung apologized to both DBS and its customers for the outage.
IBM and DBS are nearing the end of a 10-year, $1.2 billion outsourcing deal covering networks and mainframes, and the local news reports said the companies would not comment on whether they expect that contract to be renewed, extended, or dropped.
DBS Group CEO Gupta's July 13 letter to customers included this summary of what triggered the crash:
"A component replacement was scheduled for 3 am, a quiet period, which is standard operating procedure. Unfortunately, while IBM was conducting this routine replacement, under the guidance of their Asia Pacific team, a procedural error inadvertently triggered a malfunction in the multiple layers of systems redundancies, which led to the outage. The IBM Asia Pacific team is the central support unit for all IBM storage systems in the region, irrespective of whether the installation has been outsourced or is being managed in-house.
"I am treating this matter with utmost priority and the full scale investigation that we initiated last week is still underway. This investigation is being done with the support of IBM’s labs in the U.S and their engineering teams in Asia. So far, we understand from IBM that an outdated procedure was used to carry out the repair."
For a more detailed look at what went wrong, here's a summary timeline compiled by the BusinessTimes.com.sg, citing DBS and IBM as its sources:
"July 3, 11:06 a.m.: IBM software monitoring tools sent an alert message to IBM's Asia-Pacific support sentre outside Singapore, signalling an instability in a communications link in the storage system connected to DBS's mainframe computer. An IBM field engineer was despatched to the DBS data centre.
"July 3, 7:50 p.m.: The engineer replaced a cable, not using the maintenance instructions on the machine, but those given by the support centre staff. Although this was done using an incorrect step, the error message ceased.
"July 4, 2:55 p.m.: The error message reappeared, this time indicating instability in the cable and associated electronic cards. The IBM engineer was despatched again to the data centre. He asked the regional IBM support centre for advice.
"July 4, 5:16 p.m.: Following instructions for the support centre staff, the engineer removed the cable for inspection and put it back using the same incorrect step. The error message ceased.
"July 4, 6:14 p.m.: The error message reappeared. Over the next 5 hours and 22 minutes, the regional IBM support centre analysed the log from the machine and recommended to the engineer that he unplug the cable and look for a bent pin. Throughout all this, the storage system was still functioning.
"July 4, 11:38 p.m.: The engineer did not find a bent pin and put the cable back. The error message persisted. The regional support centre and the engineer continued trying to uncover the problem, including unplugging the cable and putting it back, again. DBS was contacted and authorised a cable change at 2:50am, a quiet period. While waiting to replace the cable, the IBM engineer decided to inspect the cable again for defects and check that it was installed properly. He unplugged the cable, again using the incorrect procedure advised by the regional support centre staff.
"July 5, 2:58 a.m.: He replaced the cable using the same procedures as before. This caused errors that threatened data integrity. As a result, the storage system automatically stopped communicating with the mainframe computer, to protect the data. At this point, DBS banking services were disrupted."
RECOMMENDED READING: Global CIO: Larry Ellison & The New Oracle Rock The Tech World Global CIO: The CEO Of The Year Is SAP's Bill McDermott Global CIO: IBM Doubles Down On Red-Hot Optimized Systems Global CIO: Oracle's Top 10 Retail-Industry Insights Global CIO: In Database Wars, Oracle Blasts Microsoft And IBM Global CIO: Microsoft Joins Oracle & IBM In Rise Of The Machines Global CIO: Oracle Reveals Strategy & Customers For White-Hot Exadata Global CIO: Larry Ellison's Hardware Boasts Are Nonsense, Says IBM Global CIO: Larry Ellison's IBM-Slayer Is Oracle Exadata Machine Global CIO: Oracle Layoffs Threaten Larry Ellison's Credibility Global CIO: How SAP Is Leading The Mobile-Enterprise Revolution Global CIO: SAP's Top 10 Priorities To Become Undisputed #1 Global CIO: Oracle's Larry Ellison Declares War On IBM And SAP Bob Evans is senior VP and director of InformationWeek's Global CIO unit.
To find out more about Bob Evans, please visit his page.
For more Global CIO perspectives, check out Global CIO,
or write to Bob at [email protected].
About the Author
You May Also Like