What's a cloud service-level agreement (SLA) worth? Sometimes, not much.
SLA language is often vague, reflecting a young industry where no one is certain just how porous provider guarantees may be. There are loopholes, like if you're notified of pending downtime due to regular maintenance that makes your service unavailable, that won't count toward the guaranteed amount of uptime promised each year.
Google Compute Engine, Amazon Web Services, Microsoft Azure, and the HP Cloud cluster around a stated guarantee of 99.95% uptime, with penalties to follow if they fall under that mark in any given month or quarter and cause their customers a loss of service.
That would seem to be a high standard. After all, it only allows four hours and 38 minutes of downtime for a given year. But that's not the highest standard active among service providers. Dimension Data guarantees 99.99% uptime, and if they can do it, why not the bigger-name providers? That standard allows only 53 minutes of downtime a year.
Or possibly your contract goes in the opposite direction. NetSuite's most recent SLA document, from May 2013, guarantees 99.5% uptime, which could allow over 43 hours of downtime a year.
Salesforce.com's contract states that Salesforce will "use commercially reasonable efforts to make the online Purchased Services available 24 hours a day, 7 days a week," whatever "commercially reasonable" means. The exceptions are planned downtime such as maintenance intervals and "events beyond our control" such as earthquakes, civil unrest, acts of terrorism, or war.
Salesforce's exceptions to its pledge of constant availability include "denial of service attack." Most service providers on the Web have built in protections against denial of service attacks and don't consider them an event that justifies a service outage.
Amazon Web Services was once termed by Gartner cloud analyst Lydia Leong as "the poster child for cloud IaaS, but the AWS SLA also has the dubious status of 'worst SLA of any major cloud IaaS provider.'''
AWS's current SLA (dated June 1, 2013), also uses "commercially reasonable efforts" as the standard by which it will achieve 99.95% uptime. Its definition of when it's down, or "unavailable," leans in favor of the provider. For example, a customer using only one availability zone within an Amazon regional data center isn't acting on Amazon's recommended best practice of having backup in a second availability zone. Each region has multiple availability zones that function autonomously from each other. To violate Amazon's SLA, more than one availability zone must be unavailable to the customer. So if a whole availability zone goes down and you're in it, you collect nothing in the way of Amazon SLA credits unless your workloads also occupied another zone, and it was also unavailable.
In addition, all workloads need to be down, not just some of them, for the SLA's credit repayment to kick in. And Amazon has all year to achieve that 99.95% uptime, even if it's down for four-and-a-half hours in a given month.
Even the definition of "available" leans in favor of AWS. For EC2 to be considered "unavailable," all of a customer's instances must be unavailable in EC2 in a given region, such as the US East data center complex in Ashburn, Va. In addition, for the storage attached to running instances to be unavailable, they must be performing "zero read/write I/O, with pending I/O in the queue." In other words, if Elastic Block Store has slowed to a snail's pace, and your business is clearing three or four transactions an hour, that doesn't necessarily qualify as an outage meriting a service credit.
While HP's SLA for its public cloud is also set at 99.95% availability, that's for each month, not across a whole year. And its terms consider the loss of availability of one instance, not all instances in a given region, to be an outage. In the cloud, each service has its own SLA definitions, complicating the picture and giving providers more than one exception on an outage claim.
Amazon, to its credit, has chosen to interpret its terms liberally, offering credits to customers for its April 2011 Easter outage even though servers were up and running during it. Customers, in some cases, had no way of accessing them or making use of the Elastic Block Store storage service, as a disconnected network set off a "remirroring storm" inside its US East facilities.
What's your cloud's uptime? What's in your SLA? Can your service provider do better? Read on to learn more about what to look for in your cloud contracts. Share additional advice with your peers using the comments section.
(Source: Kara Harms on Flickr)