The Feb. 28 outage of Amazon's Simple Storage Service says one thing loudly and clearly: Amazon Web Services is growing so fast that it must rely on its automated systems to keep operating. Sometimes things need to go awry in only a minor way, and the sheer scale involved puts those systems at risk of disabling it.
That conclusion is implicit in Amazon's explanation of the event March 2. Rapid growth seems an inadequate explanation for a S3 freeze up that left Slack, Quora, Mailchimp, Giphy and many other major Web services frozen in time for four hours.
The AWS operations team said that an employee's data entry error in correcting a bug in its storage billing system lead to a cloud service outage in one region of Its Ashburn, Va., data center complex, U.S. East-1.
The entry ordered more servers to shut down than intended, and the result was, "The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests," according to Amazon's explanation.
Want to see the consequences of the Easter outage in 2011? See Cloud Takes A Hit: Amazon Must Fix EC2.
I would credit Amazon's explanation with a forthright description of the issue and the straightforward accounting of the impact. But it falls a little short of describing the total consequences. A region is a data center complex of multiple availability zones in a designated geography. U.S. East- 1 is Amazon's biggest and most popular cloud center with at least five availability zones. If the S3 objects in this region suddenly disappear from view, then every system in the cloud that needs them is going to hit the replicate button. Replication of complete data sets ties up resources for a long time, and too many systems trying to replicate is what AWS dubbed in 2011 as a "remirroring storm." That event caused its April 2011 Easter weekend outage.
The post mortem report made no mention of a remirroring storm. But something similar appears to have happened with its automated systems in February 2017, almost six years after we first learned how the cloud was prone to shooting itself in the foot this way.
Given the fact that the outage started with a data entry error, much reporting on the incident has described the event as explainable as a human error. The human error involved was so predictable and common that this is an inadequate description of what's gone wrong. It took only a minor human error to trigger AWS' operational systems to start working against themselves. It's the runaway automated nature of the failure that's unsettling. Automated systems operating in an inevitably self-defeating manner is the mark of an immature architecture.
To be sure, AWS S3 has an extremely reliable service. It went through all of 2016 without a single outage, as Gartner's cloud analyst Lydia Leong has observed. Gmail at Google and Azure at Microsoft have suffered their own failures in a similar vein. But Amazon is significantly bigger than other cloud providers. It has a significantly greater number of enterprise customers depending on it. And it has the resources to address this prospect before trouble occurs.
I found Amazon's explanation credible, but still dismaying. Both the GET subsystem and a second subsystem involved in the outage, the PUT subsystem, are designed to be able to sustain operations, even with the removal of some capacity. Removing servers from operations is one of the fundamental management systems of Amazon cloud operations.
But AWS' post mortem explanation goes on to say: "While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected."
So, if something unexpected happens and a service fails, its restart is a crap shoot. That's because Amazon, preoccupied with growing as fast as it has, has added X amount of capacity since the last system restart. It must now try to do something on a scale that's never been planned for or tested before. Not surprisingly, in the event, "the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected," Amazon's statement said.
I appreciate the candor. I can see Amazon is willing to own up to the problem and take action to correct it. The post mortem goes on to say that it has revised its capacity removal tool so that there's a check on the total amount of capacity that can be disabled at one time. But I'm surprised that that's being done at this point -- 11 years after the launch of the all-important S3 service.
Amazon has another safeguard against failure, segmenting big services into smaller units to allow pinpointing trouble quicker. "One of the most important (trouble-shooting approaches) involves breaking services into small partitions, which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem," the explanation continued.
"As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected," it said.
In other words, rapid growth resulted in very large cells that were slow to test and restart. Amazon is "reprioritizing" its plan to break S3 down into smaller, more manageable cells from "later this year" into an immediate objective.
All well and good. But somehow the air of profitablility hangs over this whole mishap. AWS has become immensely more profitable to its Amazon.com parent company as it has expanded at a pace that boggles the minds of its rivals. Perhaps it's time for some of those profits to pour back into resilient cloud design, service durability, and rapid service recovery. Troubleshooters should be more invested in imagining the inevitable, an erroneous data entry, and coming up with check points that prevent the error from spreading. The future of the cloud may depend on it.