Amazon Web Services is working hard to patch its EC2 cloud host servers for what appears to be a severe hypervisor security defect. The process started at 7:00 p.m. Pacific time Thursday. It needs to be completed in the next five days -- before Oct. 1.
During that time Amazon must shut down, patch, and reboot 10% of EC2's servers due to what appears to be a major Xen security defect. Amazon Web Services spokesman Jeff Barr confirmed in a blog post Thursday that the defect is in the Xen hypervisor.
Amazon specifically discounted any suggestion that the update and reboot had anything to do with the Bash bug, discovered Tuesday, which poses another threat to many systems around the Internet.
Amazon chose Xen in 2006 to run its Amazon Machine Image virtual machines. The supplier of Xen, the XenSource unit of Citrix, has posted a notice on Citrix's security updates page that it will air both a major vulnerability known as XSA-108 and its patch on Oct. 1. No details on the vulnerability have been published. That's to give major Xen users, such as Amazon, time to apply the patch before it becomes public. On Sept. 24, Amazon began notifying customers that the patching and rebooting would occur over the next few days.
"These updates must be completed by Oct. 1 before the issue is made public as part of an upcoming Xen Security Announcement (XSA)," Barr wrote.
[Want to learn more about the use of zones to avoid an outage? See Amazon Outage: Multiple Zones A Smart Strategy.]
"As we explained in emails to the small percentage of our customers who are affected… the instances that need the update require a system restart of the underlying hardware and will be unavailable for a few minutes while the patches are being applied and the host is being rebooted," Barr explained.
Most operating system, hypervisor, and application updates can be applied without taking the whole server down. But "certain limited types of updates require a restart," continued Barr. Amazon will stagger the updates so that no two regions or availability zones within a region are down at the same time. That means that followers of Amazon's best practices who have a backup system ready to go in a second availability zone are likely to experience no outage. But not all customers use the two-pronged guarantee of availability. Those running a production system in one zone are likely to experience a brief downtime.
Each zone's hosts "will restart with all saved data and all automated configuration intact. Most customers should experience no significant issues with the reboots," Barr wrote.
"We understand that for a small subset of customers the reboot will be more inconvenient; we wouldn't inconvenience our customers if it wasn't important and time-critical to apply this update."
Exactly how long that "inconvenience" might be may depend on the individual customer's situation.
"This update looks extremely critical," said Sebastian Stadil, CEO of Scalr, a supplier of a front-end cloud management platform for Amazon and other service suppliers. It looks like a major hole in the Xen hypervisor combined with few barriers to its use, and it's easy to implement, judging by the urgency with which Amazon is going about its updates, he said.
If it is easy to implement, once it's aired the "script kiddies will write scripts that cause lots of destruction just for fun," Stadil said in an interview.
Giving major users of the technology time to update before the vulnerability is made known is now a standard practice among researchers and security vendors, who often find the vulnerabilities. "It's considered a best practice," he said.
Stadil expressed puzzlement that Rackspace, another large user of Xen, hasn't revealed a similar update plan. Rackspace uses a version of Xen in its public cloud services; it relies on KVM in its managed cloud and private cloud operations. Rackspace couldn't be reached for comment late Thursday.
Another cloud management platform, RightScale, also gave notice to its users that the Amazon update was underway. CTO Thorsten Von Eicken wrote in a blog, his second on the subject, yesterday: "The fact that we're simultaneously 'blessed' with both this AWS event as well as the (likely unrelated) bash exploit adds to the overall emergency workload and stress" of customers.
But there are steps they can take to try to insure minimum impact on their cloud systems, he wrote. Round-the-clock services on Amazon with their own operations teams that have designed for fault tolerance "will pre-emptively take care of relaunching affected instances and will not have an issue."
They may be in the minority, however. Large-scale Amazon customers running thousands of virtual machines "will face a stiffer challenge because even manual triage of which instances are affected and which are not is impractical. The ops team may be scrambling to write scripts to automate some of these tasks."
Software-as-a-service vendors on Amazon with single-tenant instances of their applications "may have hundreds of small deployments, each of which may need some intervention," Von Eicken predicted. Amazon presents reports on consolidated accounts. Operations teams may have to sort through the accounts, looking for affected systems.
Relaunching workloads will cost smaller services time as they manually identify and undertake the process in advance of the downtime. "Some failures" are likely to occur and be visible to end users, he noted.
Those whose operations personnel are on vacation or non-existent will have to count on their servers to reboot on their own, as Amazon predicts, after a few minutes of downtime. That's possible, Von Eicken continued, but "the reality is such that I would expect a high failure rate anyway."
He also warned about "forgotten servers," buried in organizations with multiple accounts, some of which they've lost track of. The AWS email notifications in some cases are going to unread or deleted email inboxes. "I expect that almost every organization ends up with a few 'oops, forgot we had'… and some of these will turn out to cause user visible failures."
If the world weren't changing, we might continue to view IT purely as a service organization, and ITSM might be the most important focus for IT leaders. But it's not, it isn't, and it won't be -- at least not in its present form. Get the Research: Beyond IT Service Management report today. (Free registration required.)Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio