On Sept. 24, Workday's SaaS service for human resources, financial applications and payroll was down for 15 hours. That's right, not 15 minutes, not 1.5 hours, but 15 hours. Google Gmail is down for 90 minutes, as it's as if the world has come to an end. So it begs the question: Is 15 hours' downtime for core applications such as accounting and HR acceptable?The day after the outage, Workday Co-CEO Aneel Bhusri posted a blog explaining to customers what happened, but it wasn't until this week that the ERP-focused blogosphere and twitterers began discussing the incident. In a blog posted Thursday, software consultant Michael Krigsman took pangs to point out that Workday did a nice job of damage control, daring to say that the outage was actually about "a success and not a failure."
Workday gave Krigsman the phone number of marquee customer Manjit Singh, CIO of Chiquita Brands, who told Krigsman:
"Outages are never good, but they do happen. Workday's communication was fantastic: they kept us informed of the problem, steps they were taking to resolve it, and expected time to solution."
Well, glad to hear it. I've gotten to know the folks at Workday, and they're all very smart, nice, committed and hardworking people. The co-CEOs and founders, Aneel Bhusri and PeopleSoft founder/billionaire Dave Duffield, are earnest in their vision to bring positive changes to the world of enterprise software. But let's get down to the brass tacks and answer this question:
Is 15 hours of downtime acceptable? For Chiquita's Singh, it was tolerable.
"First, we lost the ability to process HR transactions during the normal course of that day's business. Second, and more significantly, we were preparing to go live with our Costa Rica implementation, so this outage had the potential to delay our schedule. However, we worked around it and went live as planned."
But Singh isn't using Workday's financial applications. It seems to me a 15-hour outage could affect payments going out, payments coming in, payroll and other important financial processes. Workday describes its cash management SaaS here:
Cash Management automates the coordination and control of cash-flow activity, automates administrative and control activities such as bank statement reconciliation, and provides business intelligence.
Is it acceptable to lose that cash management for 15 hours?
This is Bhusri's explanation of the outage on his blog:
Yesterday, the network attached storage (NAS) device that stores operating system files for our production servers detected a corrupted node within a backup RAID array. Rather than simply log the error, which is what it is supposed to do, the NAS took itself off-line. It is ironic that the redundant backup to a system with built-in redundancy caused the failure.
This type of error should not have caused the array to go offline, but it did. The most important result is that our failover plans worked as expected. Within hours, all customers were live in our secondary datacenter with all their data intact.
We've tested our failover plans many times, but this is the first time we did it for real. We've learned quite a bit in the process - some of it technical, some of it regarding communications with customers. That knowledge will be used to further refine our datacenter practices, our hardware choices, and our failover plans so that we can do even better in the future.
We all know that companies that run their own HR and financial apps can also have service outages. But this 15-hour outage raises some interesting questions about how CIOs will feel about it when the problem is the vendors, and not theirs. Some may pull out their hair over their lack of control with the matter and start looking for an exit strategy. Others may feel some relief that they don't have to deal with it.
In fact, this is what Dave Duffield told blogger Vinnie Mirchandani:
"Unbelievably, I got emails from couple of our customers basically saying, 'Better you than me.' They are so glad they are not being woken up middle of the night. That's our job now."
Interestingly, this is Merchindani's take on it: "If on-premise vendors were not concerned about SaaS and cloud vendors, this episode should be a loud wake-up call." Again, another software consultant giving props to Workday for handling the situation well. And it's easy to like Workday; it's the little guy trying to make a difference. What if this happened at Oracle, SAP, Salesforce.com, or heaven forbid, Google? The press and blogosphere would be on it like white on rice.
So no matter who the vendor is, the question still lingers in my mind…is 15 hours acceptable for your ERP to be down, especially when it's someone else who's handling it, and not your internal IT team?
Would love to hear others' thoughts on this matter.