How does Okta ensure its service stays up and running on Amazon EC2? It takes an extreme view of Amazon's advice and replicates data in five availability zones.
One customer of Amazon's EC2 cloud has a bit of advice on maintaining high availability of your applications there. Follow Amazon's instructions and put your system in more than one availability zone (AZ) to protect against an outage.
As a matter of fact, Okta, a cloud-based user access and identity management service, initially used three availability zones. Its operations were based in Amazon's U.S. East facility in northern Virginia, Amazon's most heavily trafficked data center complex. When four zones became available, Okta adopted four, and more recently, a fifth.
"If there's a sixth zone tomorrow, you can bet we'll be in it within a few days. We make use of every possible zone. We need to be up at all times," said Adam D'Amico, Okta director of technical operations.
The stakes are high for Okta since its operations run entirely on top of Amazon Web Services (AWS) EC2. Any outage there results in Okta "directly passing it on to our customers," whose employees will not be able to access their applications. In short, D'Amico's advice on availability zones: "The more, the better."
That's what Amazon has been saying all along, sort of. When outages have occurred, some of its most experienced customers, such as Netflix and Zynga, stayed up and operational because they used more than one availability zone. An application failure in one availability zone leads to failover and renewed operations in a second.
In its March 27, 2008, announcement of the zones, Amazon said putting an application in a second zone "is as easy as changing a parameter in an API call." The benefit of doing so was that each zone was physically separate from the others, with its own power, cooling, and networking. The interruption of any of those services in one zone theoretically leaves them intact in others.
If you do, you can get the high availability in the cloud that was previously limited to "only very large companies" who could afford to put duplicate systems in two different locations, said Amazon's original announcement of the zones.
Okta's D'Amico says the concept works, not always perfectly, but it still works.
"We make sure to replicate the data to all five zones," he said in an interview from his San Francisco office. The availability zone may be across the street or it may be in a building that is across town. Either way, he said, the connections between zones have been designed with low latencies, and synchronization of data can take place quickly across zones.
The use of more than two availability zones may be wise. The CEO of WhatsYourPrice.com, a dating website, said despite his firm's use of two zones, the service's website was off the air for two hours June 29, a Friday, when many customers needed to access it to make connections with their dates that they had previously initiated. WhatsYourPrice.com left Amazon's EC2 after the second blackout of its website in two weeks and set up its own physical facilities.
D'Amico said that's not really necessary, but establishing your systems in more than two zones is a good idea. "We deployed on AWS from the very beginning," making use of three zones, he said. That wasn't enough when Amazon experienced its worst outage to date, April 19-21, 2011, known as the Easter weekend outage. The Okta service was knocked out of commission for seven minutes as its engineers troubleshot a software bug that was preventing system failover. At the time, Okta made use of four availability zones. By fixing the bug, the service became available again in Okta's fourth zone; three others were suffering service freeze ups that would have otherwise knocked Okta off the air.
"Three out of four zones were impaired during that outage. We had expanded into all four, so we were in good shape," he recalled. Now it's expanded into all five, as noted earlier.
Using multiple availability zones is a good strategy for high availability. True disaster recovery, however, requires that recoverable systems be placed in a different geographic location than, say, Ashburn, Va., where all five of Okta's primary zones are located. To accomplish that, Okta keeps backup systems in one of Amazon's U.S. West zones, in California's Silicon Valley and Oregon. It does a dry run of failover to the West Coast site every quarter to ensure that if the worst happens, and five availability zones fail on the East Coast, it can recover and continue operating from the West Coast. So far, D'Amico thinks four is the lucky number.
AWS VP Adam Selipsky in an email message said customers may use availability zones, along with other measures, to engineer the degree of high availability that they desire. AWS Elastic IP addresses that are associated with an account, rather than a static location, aid quick movement of systems; AWS Elastic Load Balancing can detect an instance outage and route traffic to running instances in other zones.
Amazon provides guidance on building fault-tolerant applications in its white paper linked above, but "fault tolerance" is a more bulletproof standard than just running in two availability zones, Selipsky concedes.
"Simply running some of the application's instances in multiple availability zones will not make an application fault tolerant. A fault tolerant application ... needs to be able to run simultaneously and independently in each AZ. This means if an application requires three servers to operate normally, it needs to be built on three servers in each AZ," such as an application server, database server, and Web server.
"Overall, during the past six years, customers have told us that they have had very good operational performance on AWS, which is one of the primary reasons AWS has grown so rapidly," he continued. "That said, we will not be satisfied until our uptime record is statistically indistinguishable from perfect," he added.
Given the occurrence of three to four AWS outages over the last 18 months, the record still falls short of perfect. Nevertheless, Okta and other customers' experience shows extra investment to stay running in the cloud can move that goal quite a bit closer than it was before.
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
. We've got a management crisis right now, and we've also got an engagement crisis. Could the two be linked? Tune in for the next installment of IT Life Radio, Wednesday May 20th at 3PM ET to find out.