This phase emphasizes exhaustiveness, only now applied to ensuring that each service level covers the full scope of the provider's responsibility. For example, as the provider's service includes its network connection and infrastructure, parameters such as service availability and response time should include these components, rather than using measurements from a monitoring server on the same LAN segment. Another important objective when writing the SLA is to avoid future disagreements by making the service levels as quantitative and unambiguous as possible.
At a minimum, every SLA should include the following:
>> Detailed description of the service level. This includes points of demarcation, triggers to initiate and terminate measurement, and criteria for success and failure. Define terms that otherwise might be open to interpretation: For example, is a degraded service still considered available? When is a problem classified as high priority?
>> Explanation of the data-collection process. Minimize ambiguity by describing data sources and data fields, collection times and frequency, and responsibility for data-collection activities. Decide if you'll use data collected by the provider, establish an internal monitoring capability, or use third-party cloud monitoring services, such as Cloudkick, Monitis, or Gomez.
>> Outline of the performance calculation. This can have a marked effect on service-level effectiveness. Consider how different calculation approaches will drive incentives and behavior. If resolution effectiveness is measured as mean time to repair, a provider with a large number of quick fixes can get away with having a single, extraordinarily long issue. Conversely, requiring that 95% of incidents be resolved within four hours provides an incentive for the provider to deprioritize resolution activity as soon as any incident enters the fifth hour. One answer is to include both "mean" and "maximum" resolution as distinct metrics within the SLA. Alternatively, you can develop compound service levels--for example, 95% of incidents resolved in four hours, 100% of incidents resolved in one business day.
Step 3: Set Realistic Performance Targets
Establishing the required level of service performance is another challenge. Set the threshold too low, and service will not meet your expectations; set it too high (as most IT teams tend to do) and you'll likely incur additional costs or miss opportunities to obtain concessions, such as tighter SLA exclusions, reduced credit caps, and other contractual terms. For cloud services, performance negotiations are further complicated by the provider's limited ability to offer differentiated delivery and support models. This means customers generally "get what they get," and any incremental tightening of service levels is reflected in increased service costs to offset anticipated losses from occasional SLA failure.
You have two main options to determine the performance needs of the business. If your company has historic data regarding its own performance, you may use that as a baseline for requested performance, adjusted based on business opinions of the performance and current requirements. Alternatively, you can use your stated performance measurement and calculation techniques and figure out the point at which a performance drop-off starts hurting the business. If neither is possible, you may need to research performance commitments available in the market for similar services through vendor information, account reps, and colleagues or user forums.
Be wary if providers request that performance exceeding one or more targets be used to offset shortfalls elsewhere. This might seem fair on first review, but it can distort the SLA model. If any service levels are easy to meet consistently (as is frequently the case), the provider effectively gains a free pass for service-level violations. What's more, there's often little benefit to the cloud customer for the provider to exceed stated performance targets. If the business doesn't need a service instance to be provisioned in less than one minute, why encourage the provider to speed up that service?
Step 4: Define Remedies For Failure
SLA credits are always one of the most heavily negotiated areas in an agreement. The credit structure is usually developed in two stages. First is settling on the total "fees at risk"--that is, the maximum compensation in service-level credits that the provider would be required to provide. Next is the less contentious allocation of those fees to individual service levels.
Conventional SLAs generally end up with around 10% to 15% of fees at risk, with a 200% to 250% multiplier. The effect of the multiplier is to increase the sensitivity of the agreement to individual performance failures, while capping the credit amounts payable for performance failure. A 15% cap with a 200% multiplier allows the customer to allocate the equivalent of 30% of the fees to the individual service levels. A variation of this approach is a system of points where (in this example) 200 points would be allocated across the individual service levels, with credits calibrated such that a total of 100 points would entitle the customer to a 15% credit payment.
In general, comparing more general outsourcing SLAs to cloud agreements, expect to settle for considerably less for cloud agreements, given the volume-based delivery model and thin margins. Instead, focus on non-monetary resolutions to SLA failures. Avoid limitations on further remedies, such as consequential damages or the option to terminate the agreement without liability. Retaining the right to pursue additional damages--in the case of gross negligence on the part of the provider, for example--is one reason you don't want to sign any service agreement that states, "Performance credits are the sole and exclusive remedy for performance failures."
Jonathan Shaw is a principal at Pace Harmon, an outsourcing advisory firm. Write to us at [email protected]