Government // Mobile & Wireless
Commentary
10/13/2011
04:44 PM
Connect Directly
LinkedIn
Twitter
RSS
E-Mail
50%
50%

RIM Outage Explanation Leaves Big Questions

RIM did more dancing around the issues than frank sharing as it tried to explain the BlackBerry outage--leaving CIOs to speculate. And goodwill's running short.

After an almost four-day outage of RIM's Blackberry service, RIM's co-CEOs gave a status update Thursday morning. Mike Lazaridis delivered what appeared to be a prepared statement, followed by questions, largely from the media. The way that RIM reacted to the outage will likely shape the company's fortunes for the foreseeable future. And on the key question, the future health of RIM's network and its ability to scale, too many questions went unanswered.

Lazaridis started out with an apology and something of a promise. "You expect better of us, I expect better of us," he said. "We are, and will take every action feasibly, to minimize the risk of this happening again."

Apparently, one switch's failure with a bonked-up backup system had such a tremendous "ripple effect" that it caused a world-wide outage for days.

The question that many CIOs and CTOs are asking is, if architecture is planned out right, and testing occurred on a reasonably diligent basis, how exactly could that happen?

[ For more analysis, see BlackBerry Service Outage Spells RIM Doom. ]

Lazaridis says that "root cause analysis" is still ongoing. In his words: "A dual, redundant, high-capacity core switch designed to protect the core infrastructure failed." This apparently caused outages and delays in Europe, the Middle East, Africa, India, Brazil, Chile, and Argentina. "This caused a cascade failure in our system. There was a backup switch, but the backup did not function as intended and this led to a backlog of data in the system. The failure in Europe in turn overloaded systems elsewhere. When we restarted the system based in Europe, the queue processing took longer than expected."

This, in turn, caused service outages everywhere else, including the United States.

Lazaridis took pains to point out that RIM tests systems on a regular basis. He pointed out a 99.97% service level over the past 18 months, and promised that RIM is doing everything in its power to aggressively minimize the risk of a re-occurrence. Specifically, RIM will work with the vendor to "correct the particular failure mode in the switch that occurred Monday," audit the infrastructure, and continue to investigate root cause analysis, he said.

When asked what vendors were involved, RIM was cagey, saying that the company had a multi-vendor infrastructure and that it was too soon to start talking about vendors.

During the call, one analyst asked how the European failure could have cascaded the way that it did, and specifically asked whether RIM only had two operating centers. Jim Balsillie, co-CEO, jumped in and said that "it happened exactly the way Mike described it." Which, of course, didn't really answer the question.

Other great questions lobbed on the call weren't answered in ways that gave listeners great confidence.

Was it definitely a hardware failure? "We don't know why it failed the way that it did," Lazaridis said. RIM seemed clear that the outage was NOT preceded by changes in hardware or software, but didn't elaborate.

And, given RIM's layoffs, it was natural for some listeners to question whether the reliability of the infrastructure had been compromised by key staff departures. After all, once the hammer starts coming down, your best folks start leaving. The answer to this, too, was "no." But "the team that manages emergency ops is a highly skilled team that manages this, this would not have affected them," isn't exactly a resounding answer as to retention policies that might prevent a mass exodus of engineers.

CIOs listening to these answers will notice what was said--and what was not. The stakes could not be higher for RIM, nor the timing worse, as InformationWeek's Fritz Nelson noted yesterday.

RIM's customers know that reliability, management tools, and security strength, are what RIM has to offer enterprises now. RIM's not competing on features.

There's no doubt that more information will be forthcoming in the days to come. But for now, serious worries remain about the architecture, testing procedures, and other aspects of RIM's venerable and once-mighty data service.

Jonathan Feldman is a contributing editor for InformationWeek and director of IT services for a rapidly growing city in North Carolina. Write to him at jf@feldman.org or at @_jfeldman.

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Page 1 / 2   >   >>
ANON1237925156805
50%
50%
ANON1237925156805,
User Rank: Apprentice
10/17/2011 | 5:16:40 PM
re: RIM Outage Explanation Leaves Big Questions
I agree. Four 9's is minimum especially if one's only calling card these days is a robust enterprise server aiding in ease of device administration. As Ricky Riccardo might have said, "RIM, you got some 'splaining to do."
ps2os2
50%
50%
ps2os2,
User Rank: Apprentice
10/16/2011 | 3:43:51 AM
re: RIM Outage Explanation Leaves Big Questions
Not only do they pat themselves on the back but the pay base is in the stratosphere and one has to wonder why their basic pay should be based on uptime and anything less than the 4 9's should be shown the door with no parachute.
ANON1241263318643
50%
50%
ANON1241263318643,
User Rank: Apprentice
10/14/2011 | 7:35:10 PM
re: RIM Outage Explanation Leaves Big Questions
He identified 99.97% as a good metric. Anything less than four-9s nowadays is considered disastrous. They have to do WAAAY better than that.
sonicmetalman
50%
50%
sonicmetalman,
User Rank: Apprentice
10/14/2011 | 6:02:47 PM
re: RIM Outage Explanation Leaves Big Questions
Amazing. A poorly executed tap dance that threw out some of the famous buzz phrases we all love to hear, "root cause analysis", "failure mode", and "redundant". Wow. How much do these guys get paid to run a sinking ship? It's almost sad to see a company that was the envy of the telecom world become such a parody of itself.
LyricalBard
50%
50%
LyricalBard,
User Rank: Apprentice
10/14/2011 | 5:47:22 PM
re: RIM Outage Explanation Leaves Big Questions
As an IT guy for a large company that has oursourced most of its workforce to India I see what's obvious. Unfortunately the CIO and CEOs of these major companies just pat themselves on their respective backsides giving themsellves mutli-million dollar raises for jobs well done. They've
laid off a majority of their IT staffs that built and maitain their networks for cheaper India laborers who are mostly right out of school. When things go wrong they continue to cast blame and fire the seasoned IT guys on the ground. If you speak up and tell the emperor he has no clothes then you get fired too.
stevegg
50%
50%
stevegg,
User Rank: Apprentice
10/14/2011 | 5:42:08 PM
re: RIM Outage Explanation Leaves Big Questions
I here RIM is in talks to stream Netfix!
MFARNHAM000
50%
50%
MFARNHAM000,
User Rank: Apprentice
10/14/2011 | 12:11:43 PM
re: RIM Outage Explanation Leaves Big Questions
Do people still use Blackberry? Are they even still relevant? Just one more reason for people to dump the product,
roughplums
50%
50%
roughplums,
User Rank: Apprentice
10/14/2011 | 7:58:48 AM
re: RIM Outage Explanation Leaves Big Questions
RIM's architecture is so old: three data centers (two of them next to each other) for 60-70 million users. This generates such a high risk for the most mundane failure, such as this week's Cisco switch failover. RIM should have built multiple data centers in many countries years ago. In their poor judgement they didn't see the benefits of reduced risk as opposed to increase in operational cost. With the swoosh sound of departing customers, they must be realizing the stupidity of that short-sightedness. Only other sinister explanation for keeping data centers in Canada and the UK could be to keep their encryption software away from the reaches of regulators in countries like UAE, India, etc.

Time to move on, BB was a good solution 10 years ago. Not anymore. By the way, Apple may risk a similar fate if they pump iCloud too much while trying to cram everything in their North Carolina data center.
webspinner
50%
50%
webspinner,
User Rank: Apprentice
10/14/2011 | 1:16:54 AM
re: RIM Outage Explanation Leaves Big Questions
I'm out! Welcome Windows Phone Mango Nokia! Much less TCO to worry about!
jfeldman
50%
50%
jfeldman,
User Rank: Strategist
10/14/2011 | 1:11:39 AM
re: RIM Outage Explanation Leaves Big Questions
Absolutely. As we used to say about single sign on, it can be "single vulnerability."
Page 1 / 2   >   >>
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A UBM Tech Radio episode on the changing economics of Flash storage used in data tiering -- sponsored by Dell.
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.