First, don't be dismissive of the cloud. Business units will simply bypass IT if it doesn't provide guidance. Second, advise business leaders on cloud risks and risk-mitigation strategies. Third, when the decision to use a cloud service is made, establish realistic and balanced service-level agreements.
Establishing an SLA is just one aspect of protecting your organization. What's needed is a step-by-step assessment-to-implementation process, helping business managers balance risk, fiscal impact, and flexibility. If the decision for a given IT service goes in favor of a cloud approach--software as a service, platform as a service, infrastructure as a service--you need to figure out how to proceed and what to do if and when things go wrong. That's the core of not only an SLA, but of good governance.
Getting the answers to the following questions will help you determine if cloud computing makes sense, evaluate providers, determine if you need an SLA, and, if so, how to craft a strong agreement.
1. What's the use case? Why are cloud services being considered?
Say you need to develop quickly a specialized, temporary application for a boat dealership, one that scales up and then, just as quickly, gets decommissioned. That's a great use case for cloud computing. There's little capital expense needed and, with the right service provider, the app can be deployed effectively. But not every use case is appropriate for cloud computing's current level of maturity. Something that can be quickly accomplished using your internal IT infrastructure might not make sense to push out to the cloud.
2. What are the risks and benefits?
If you're going to live in the clouds, bring a parachute. In addition to evaluating the benefits, be realistic about worst-case scenarios, such as if the application isn't available, performance bogs down, or there's a security breach.
It's useful to classify these scenarios by their relative probability, as well as by their impact on the business. For example, an enterprise application that many are reluctant to provision into a cloud environment is ERP, where there's a high impact of failure regardless of the probability of failure, which is never zero.
3. Is a negotiated SLA needed? If so, what penalties are appropriate?
Think about why SLAs are necessary. It's because you're protecting something that someone else manages, and you want your service provider to have skin in the game. You know that the provider faces higher costs to offer availability guarantees, costs that could show up in the vendor's bottom line. They've got to account for that somehow, so as you turn the dial up on SLA penalties, the cloud service is going to get more expensive.
The point is that there's a natural tension between low-cost cloud computing and high SLA penalties. The risk premium that any provider must add to its business model--and thus to your pricing--is in direct proportion to financial penalties that the provider would be forced to pay out in the event of an SLA violation. There's a similar natural tension from the provider's perspective between high-availability and low-cost operations.
SLAs are all about recourse and what you can do to protect yourself when bad things happen with your cloud service provider, but SLAs aren't the only recourse. Switching service providers or using internal resources are other possibilities. So if your app dev team is using the cloud to build a testing lab quickly, and that testing could easily be rescheduled in the event of an outage, you probably don't need a tightly negotiated SLA. Many providers offer a baseline SLA with service credits, which might be fine for this application. But if a service failure would harm your business significantly, your use case probably isn't a good match for today's cloud offerings.
4. What metrics are important to our risk profile?
Key metrics in the cloud include availability, response time, and customer service response time (that is, how long you wait after you report a problem). Including provisions for important things like security may make sense, but that depends on the use case. If you're doing infrastructure as a service, you're never going to be able to get a security provision; too much is outside the provider's control, since you will be the one patching and otherwise administering your virtual servers. With a software-as-a-service vendor, you're more likely to get such a provision.
While uptime and response time matter, the most important metrics for judging the quality of service depend on your use case. If you're moving to cloud computing for its ability to scale quickly, ask the provider how it measures that ability. This isn't your dad's set of metrics. If it's important to your use case that geographically distributed servers provide superior service to a nationwide audience, you'll want to measure metrics in a sample of those regions.
Think about what might cause metric logging to be suspended, and who's responsible for external events. How will you take your application offline for maintenance? What if a malicious third party launches a distributed denial-of-service attack?
5. Is our environment the weakest link?
Midsize enterprise networks are sometimes less robust than they could be when it comes to Internet connectivity. That can be the result of a desire to minimize data entry and egress points in order to lower security risks. If your application needs to phone home in order to work (if it communicates with a database housed within the enterprise network, for example) and the enterprise network encounters Internet connectivity problems, then the cloud app may also have connectivity problems. There are strategies to mitigate this scenario, of course. The point is, when you're building an SLA, you'll need to quickly identify any points of failure in your environment in order to be a good citizen and maintain credibility with your service provider.
6. What have pilot tests and reference checks shown?
We don't care how great a vendor's proffered SLA looks; you must conduct due diligence on the service provider if the use case involves anything important. Even the vendors agree on this point. "You want to talk to two or three customers and vet the vendor," says Ian Knox, senior director of product management with Skytap, a cloud service provider. "If the vendor isn't growing, and you see negative things about the vendor on social networking sites, they won't be around for long."
7. Who will measure SLA metrics?
If you're going to turn over something important to an external service provider, you can generally count on getting honest metrics from it. That said, the most pragmatic approach is to get a third-party perspective. Using an application performance management provider such as Cloudkick, Gomez, enStratus, or Apparent Networks can illuminate problems even before the provider takes action.
Using third-party measurement has its own set of benefits and challenges. Service providers probably won't recognize a third party's measurements when it comes to invoking an SLA penalty. On the other hand, using a third-party vendor can root out problems before your customers or partners notice an application slowdown and serve as an early warning for you to check with the service provider.
8. From what perspective will SLA metrics be measured?
Think about the location from which they'll be measured. If your use case requires getting the application into geographically dispersed data centers, measuring from multiple perspectives will ensure that you're not just looking at the tail of the elephant.
And it's not just measuring from the outside; the inside can be vitally important. Your multitiered application might be behind the service provider's firewall, making it tough to measure database server response times versus application server response times. Knowing how these differ might be key in troubleshooting or in pinpointing the source of a problem for the service provider.
9. What happens if the provider completely fails?
Somewhere in your planning, take into account what would happen if the service provider fails for an unacceptably long period. With a formal contract, this would typically mean ending the relationship. With a pay-as-you-go cloud service, you might not have much of a service contract.
Terminating the service must be part of your thought process. In any quasi-outsourced business model, the responsibility for switching providers or for picking up the pieces when one fails lies with your IT team. The folks in the business unit will blame you if things go south, so have a plan to get out quickly if you must.
Jonathan Feldman is an IT executive and consultant based in North Carolina.
Write to us at [email protected]