RIM did more dancing around the issues than frank sharing as it tried to explain the BlackBerry outage--leaving CIOs to speculate. And goodwill's running short.
After an almost four-day outage of RIM's Blackberry service, RIM's co-CEOs gave a status update Thursday morning. Mike Lazaridis delivered what appeared to be a prepared statement, followed by questions, largely from the media. The way that RIM reacted to the outage will likely shape the company's fortunes for the foreseeable future. And on the key question, the future health of RIM's network and its ability to scale, too many questions went unanswered.
Lazaridis started out with an apology and something of a promise. "You expect better of us, I expect better of us," he said. "We are, and will take every action feasibly, to minimize the risk of this happening again."
Apparently, one switch's failure with a bonked-up backup system had such a tremendous "ripple effect" that it caused a world-wide outage for days.
The question that many CIOs and CTOs are asking is, if architecture is planned out right, and testing occurred on a reasonably diligent basis, how exactly could that happen?
Lazaridis says that "root cause analysis" is still ongoing. In his words:
"A dual, redundant, high-capacity core switch designed to protect the core infrastructure failed." This apparently caused outages and delays in Europe, the Middle East, Africa, India, Brazil, Chile, and Argentina. "This caused a cascade failure in our system. There was a backup switch, but the backup did not function as intended and this led to a backlog of data in the system. The failure in Europe in turn overloaded systems elsewhere. When we restarted the system based in Europe, the queue processing took longer than expected."
This, in turn, caused service outages everywhere else, including the United States.
Lazaridis took pains to point out that RIM tests systems on a regular basis. He pointed out a 99.97% service level over the past 18 months, and promised that RIM is doing everything in its power to aggressively minimize the risk of a re-occurrence. Specifically, RIM will work with the vendor to "correct the particular failure mode in the switch that occurred Monday," audit the infrastructure, and continue to investigate root cause analysis, he said.
When asked what vendors were involved, RIM was cagey, saying that the company had a multi-vendor infrastructure and that it was too soon to start talking about vendors.
During the call, one analyst asked how the European failure could have cascaded the way that it did, and specifically asked whether RIM only had two operating centers. Jim Balsillie, co-CEO, jumped in and said that "it happened exactly the way Mike described it." Which, of course, didn't really answer the question.
Other great questions lobbed on the call weren't answered in ways that gave listeners great confidence.
Was it definitely a hardware failure? "We don't know why it failed the way that it did," Lazaridis said. RIM seemed clear that the outage was NOT preceded by changes in hardware or software, but didn't elaborate.
And, given RIM's layoffs, it was natural for some listeners to question whether the reliability of the infrastructure had been compromised by key staff departures. After all, once the hammer starts coming down, your best folks start leaving. The answer to this, too, was "no." But "the team that manages emergency ops is a highly skilled team that manages this, this would not have affected them," isn't exactly a resounding answer as to retention policies that might prevent a mass exodus of engineers.
CIOs listening to these answers will notice what was said--and what was not. The stakes could not be higher for RIM, nor the timing worse, as InformationWeek's Fritz Nelson noted yesterday.
RIM's customers know that reliability, management tools, and security strength, are what RIM has to offer enterprises now. RIM's not competing on features.
There's no doubt that more information will be forthcoming in the days to come. But for now, serious worries remain about the architecture, testing procedures, and other aspects of RIM's venerable and once-mighty data service.
Jonathan Feldman is a contributing editor for InformationWeek and director of IT services for a rapidly growing city in North Carolina. Write to him at firstname.lastname@example.org or at @_jfeldman.
InformationWeek Tech Digest, Nov. 10, 2014Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?