Traffic in Amazon Web Services' most heavily used data center complex, U.S. East-1 in Northern Virginia, was tied up by an outage in one of its availability zones Monday morning. Damage control got underway immediately but the effects of the outage were felt throughout the day.
Customers were affected shortly after noon Eastern Time, when they were unable to access Amazon's Elastic Beanstalk scaling and Elastic Block Store service, which holds frequently accessed data used by hosted applications such as Salesforce.com's Heroku cloud platform, Pinterest, and news aggregator Reddit. Netflix, Github, Minecraft, Airbnb, FastCompany, and FourSquare also reported that they had been affected.
"We are currently experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region. New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance," Amazon's Service Health Dashboard reported at 11:26 a.m. Monday.
Other services, such as Amazon's Relational Database Service, depend heavily on EBS.
Teacher forum and education site Edmodo.com noted that its servers were unavailable in a Twitter posting at 2:20 p.m: "Update: The site is still down. This is a server issue related to Amazon and we will update as soon as we have more info."
[ Want to learn more about how Amazon's availability zones work? See Inside One Amazon Customer's Zone Defense. ]
Sites that operate on a strict budget often take advantage of the minimal infrastructure costs associated with Amazon cloud services and operate in only one availability zone. But an outage in one zone can sometimes affect the availability of some services in others, as seen in the Easter weekend outage in April 2011.
Savvy customers, such as Netflix, who've made a major investment in use of Amazon's EC2, can sometimes avoid service interruptions by using multiple zones. But as reported by NBC News, some Netflix regional services were affected by Monday's outage.
The outage started as a slowdown in response times and an increase in error message rates in the Elastic Block Store service in one availability zone. The site hosts five different zones, or virtual data centers, each with an independent source of telecommunications power and backup power. Some customers keep recovery copies of their systems in a second zone to provide a failover mechanism if one availability zone goes down.
Okta, an Amazon EC2-based identity management service, uses all five zones to hedge against outages. "If there's a sixth zone tomorrow, you can bet we'll be in it within a few days. We make use of every possible zone. We need to be up at all times," said Adam D'Amico, Okta's director of technical operations. Netflix service architect Adrian Cockcroft and others have advocated in public forums that customers use more than one zone for their own protection.
The trouble for Amazon persisted through the day. At 9:30 p.m. Eastern, its Health dashboard reported, "We are seeing elevated errors rates on APIs related to describing and associating EIP addresses. We are working to resolve these errors. In addition, ELB is experiencing elevated latencies recovering affected load balancers and making changes to existing load balancers. These delays… will improve when that issue is resolved."
At 10:36 p.m. Eastern, it added, "…we expect ELB to recover more quickly now." Most problems were cleared up by 1:30 a.m. Tuesday.
Most IT teams monitor website performance. It's time to extend that vigilance to all critical applications. Also in the new, all-digital Application Early Warning System issue of InformationWeek: While Oracle and SAP wage a war of words, they're ignoring the wishes of customers like Procter & Gamble. (Free registration required.)