After a botched explanation for the BlackBerry outage, RIM customers are imagining the worst. RIM must take two steps right away to regain IT's trust.
RIM's choice to stay vague about the details behind the almost four-day outage during Thursday's conference call update has generated lots of speculation about what, exactly, went wrong. Precise communication would have made things more clear, but RIM kind of dropped the ball on that one. Customers are now left to speculate about things that might be far worse than the reality.
It's easy to be sympathetic to RIM, especially if you've worked for any length of time in enterprise IT. When it comes to outages, we've all been there. The system is down, users are wondering, and it's hard to know what to do. But this type of situation is IT 101. Support handles communication, infrastructure folks handle the problem. Is that so hard?
RIM's major mishandling of this incident is that they focused on having top management in the trenches working the problem instead of communicating. RIM positioned this as heroics. During the conference call, listeners questioned why it took so long to start communicating, why co-CEO Mike Lazaridis took until Thursday morning to explain, given that the outage started on Monday. Jim Balsillie, co-CEO, jumped on defense. "I'll speak for Mike: He was directly commanding the team, nobody's gone home since Monday."
Dude. You are talking to the most sympathetic audience in the world, don't mess it up. Again, there's not a CIO on the planet that doesn't understand how it feels to have a major outage going on. So much is vague. So much is unknown. So much pressure. You try everything. So, all you had to do was have someone who isn't working on the problem make the apology or announcements. You do have a co-CEO, don't you? And for organizations that don't, surely there's someone who is not in the trenches working on the problem.
What about those marketing folks? If you're in the trenches, someone else needs to communicate. You haven't slept. You're in no shape to communicate. And, you're so focused on solving the problem that you would NEVER waste time taking your eye off the ball to send out a press release or go on video. Of course someone else needs to do that. That's what your business partners are for, and it would be dysfunctional to expect infrastructure folks to do that. That's how IT has sensibly evolved over the last 20 years.
It's a simple mechanism. But RIM screwed it up. And that's why InformationWeek readers and others have started to speculate. After all, they had no good information.
As InformationWeek's Laurianne McLaughlin pointed out earlier today, the Twittersphere is all abuzz with complaints and cracks about RIM. (New definition of Crackberry?) But the folks on the private mailing lists that I'm on are even more abuzz.
On one private list, a CIO summed up the group's feelings well, saying that RIM's architecture "was just what was needed when there was low bandwidth and no mobile apps." Despite RIM's statements yesterday to the contrary, he went on to say, "this problem highlights how bad a single point of failure architecture is and why it should be avoided."
One CTO for an ISP observed, in the information vacuum, "it appears that their services are very regionalized with little or no redundancy in other regions. They tried to imply they have redundancy by stating they have multiple data centers around the globe, but it's obvious that each region runs through one or a few data centers in that region so they've only split the problem into smaller pieces, but haven't solved it."
And a reader comment on my story yesterday speculated that "they've laid off a majority of their IT staffs that built and maintain their networks for cheaper India laborers who are mostly right out of school. When things go wrong they continue to cast blame and fire the seasoned IT guys on the ground. If you speak up and tell the emperor he has no clothes then you get fired too." RIM brushed this type of concern aside yesterday, but they stopped short at specifics. Earth to RIM: specifics are VERY needed right now.
If RIM is going to survive this debacle, they need to do two things. First, take a course in IT 101 communications. This speculation is killing them. Second, RIM needs to be very specific about how they will be testing in the future.
Netflix regularly throws a "chaos monkey" into their software infrastructure, randomly breaking some of the infrastructure. That's the ONLY way that you really know whether something is going to fail over or not.
Yesterday's bonked up backup switch was inexcusable, and it would have been caught had RIM tested it. RIM needs to adopt the open communication practices of companies like NetFlix. It may be operationally and culturally difficult, but RIM must do so to regain user trust.
Jonathan Feldman is a contributing editor for InformationWeek and director of IT services for a rapidly growing city in North Carolina. Write to him at email@example.com or at @_jfeldman.
Virtualization support, memory, and bandwidth are in, our annual State of Server Technology Survey finds. Download the issue now. (Free registration required.)