Cloud service level agreements (SLAs) remain a sore spot among customers of cloud computing services. Not only are SLAs rife with ill-defined terms and vendors' self-serving phrases such as "at our discretion," but they also tend to use only one metric: service uptime.
And it is a somewhat lenient metric at that. The major service providers such as Google, Amazon, and HP guarantee 99.95% uptime, meaning the service can be unavailable for 4 hours and 23 minutes a year without incurring any penalty to the supplier.
"If I were an enterprise cloud user, I'm not sure that's the only thing I would be concerned about," said Sharon Wagner, CEO of Cloudyn, a third-party monitoring service. Wagner has invented a language for describing SLAs in a way that allows them to be automatically monitored and enforced by software systems -- and he's even seeking a patent on it. He said in an interview that customers should look at additional metrics such as predictable performance levels, consistent response times, and expandable service when you need it, which is called cloud elasticity. Most SLAs are silent on those points, and performance levels could vary widely in the course of a month or year without customers having any recourse.
In one respect, there's been a slight improvement in the otherwise weak, common cloud SLA. The HP Helion Cloud uses its SLA as a point of distinction from better established providers. HP points out that most cloud service providers average their SLA uptime percentage over a year. "At Helion Public Cloud, we consistently offer protection with an SLA of 99.95% monthly availability," says the company's SLA.
[What is in store for the cloud in 2015? Read 10 Cloud Analytics & BI Platforms For Business.]
It's a small difference, but 99.95% a month allows just 22 minutes of downtime, because the metric is applied to each month separately rather than averaged over the course of a year. Amazon Web Services, which used to state the annual percentage, also adopted the monthly metric without fanfare sometime in 2014. Microsoft said its its 99.95% uptime guarantee now applies on a monthly basis. But it had to supply a download link to a recent SLA document for a reference to its “monthly” application to appear.
Amazon's EC2, Microsoft Azure, and Google Compute Engine all use the 99.95% guarantee. Google, Microsoft and Amazon have recently switched to the monthly application of it. For a major retailer using a cloud service to host its ecommerce systems, a 22-minute outage approaching the holidays in December has a lesser business impact than does a four-hour-and-23-minute one -- which would be allowable for 99.95% uptime over the course of a year. That shows how the precise wording of a SLA can make a big difference.
The Rackspace SLA for cloud server hosts doesn't mention a percentage of uptime. Rather, it says if a host goes down, Rackspace will repair it within an hour. If it remains down over an hour, a penalty of 5% of the customer's server time per month is applied for each hour of outage. In other words, after 21 hours Rackspace owes the customer 100% of the month's bill for that server as repayment for workloads that were down. That would amount to a small amount of money compared to the business impact such an outage might have. Furthermore, even if the server is only down for an hour, that amounts to an SLA that is weaker than Amazon's. It is 98.57% uptime for that month, before any penalty kicks in.
Amazon's SLA is more straightforward than it used to be, but still retains phrasing such as: "Your sole and exclusive remedy for any unavailability, non-performance, or other failure by us to provide Amazon EC2 or Amazon EBS is the receipt of a Service Credit." In other words, don't expect any money to change hands, even if there has been damage to the business. Penalties are paid out in grants of free time on EC2 servers.
Providers try to couch their limited guarantees in self-protective language and what should be covered by the agreement doesn't warrant a mention, according to knowledgeable observers. When a reader asked if Amazon had a cloud SLA on the public forum Quora in 2010, Jason Read, co-founder of the third-party monitoring service CloudHarmony, responded: "SLAs don't really mean much. The typical financial compensation offered for not meeting SLAs is close to nothing."
Henrik Schinzel, CTO and co-founder of Avail Intelligence in Malmo, Sweden, responded: "I have read a bunch of SLAs and dissected them, and come to the conclusion that most of them serve two purposes: 1) Create a false sense of security for the customer; 2) Provide the companies with a bunch of legal loopholes."
Don't look at the SLA; look at the company's track record, he advised.
Some SLA language has been made less vendor self-excusing since four years ago, but the penalties remain the same in all cases: time credits instead of money payment.
And performance is an area that still doesn't even get addressed. A survey by the application performance management company Compuware last year found that 79% of cloud users found their SLAs "too simplistic," and 73% believed cloud providers were hiding infrastructure problems that affect workload performance.
That's also a clear worry of Cloudyn CEO Wagner: "Workload performance metrics are not represented at all."
"My concern is, if you do experience a failure, how fast will you recover? And will you be able to restore my data?" he added. A metric in the SLA that would govern frequency of failures would be a stated mean time between failures, which could be accompanied by stated mean time to recovery.
Another metric he recommends is stating the number of technical support trouble tickets that can be escalated by the customer, since limits frequently exist in the mind of the provider.
Additional metrics will start appearing in cloud SLAs as competition continues to heat up, he predicted. "More metrics will be added so enterprises will trust service providers." He favors more measures and more visibility into cloud operations over more penalties, until the industry matures further.
Even with these metrics, customers will need a way to monitor and manage their providers. For those who have rushed into cloud computing as a way to expand their data center resources quickly, they must depend on stats from the suppliers themselves.
Larger customers are already inserting some of the metrics he advocates in negotiations over their support contracts, Wagner said. But improved SLA measures should not be reserved to the largest customers. Standard SLA metrics need to become the order of the day for all users along with ways to ensure they are enforced, he said.
Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.