Cloud Suppliers Quickly Patched Xen Bug
Amazon was fastest off the starting blocks to patch the Xen hypervisor bug; Rackspace and IBM SoftLayer soon followed.
Windows 10: 11 Big Changes
Windows 10: 11 Big Changes (Click image for larger view and slideshow.)
Amazon Web Services, Rackspace, and IBM's SoftLayer have all rebooted a significant number of their cloud servers to patch a security flaw in the open source Xen hypervisor.
The bug was discovered by Jan Beulich, a software engineering consultant at Unix supplier SUSE, now part of Attachmate. She is a graduate of Moscow State University and works in Attachmate's Cologne, Germany, unit of Novell/SUSE.
IBM SoftLayer began notifying affected customers of potential downtime on Sept. 28, and started patching the bug at 3 p.m. UTC (Coordinated Universal Time) on Wednesday, Oct. 1. Xen open source project leaders at SUSE followed through on their plans to make the exploit public at noon UTC Oct. 1. That meant there were three hours of public exposure of the bug before SoftLayer began patching.
SoftLayer also couldn't patch all of its exposed systems at the same time. "Eliminating the vulnerability requires updating software on host nodes, and that requires downtime for the virtual servers running on those nodes. Yeah, that's not something anyone likes to hear. But customer security is of the utmost importance to us, so not doing it was not an option," SoftLayer said in a blog post.
"We are updating host nodes data-center-by-data-center to complete the emergency maintenance as quickly as possible. This approach will minimize disruption for customers with failover infrastructure in multiple data centers," wrote SoftLayer.
[Want to learn more about Amazon's Xen response? See Amazon Reboots Cloud Servers, Xen Bug Blamed.]
To avoid the possibility of taking down customers' systems running in different data centers at the same time, SoftLayer completed patching in one data center before moving on to the next. According to posts in SoftLayer's customer forums, the patching process was complete by Thursday at 10:19 AM UTC.
Amazon announced its patching plans Sept. 24 and executed them Sept. 26-30, completing the job on schedule before the exposure became public Oct. 1. It also updated one availability zone within a region before proceeding to the next. When the task was done, AWS evangelist Jeff Barr explained, "We couldn't be as expansive as we'd have liked on why we had to take such fast action. The zone-by-zone reboots were completed as planned and we worked very closely with our customers to ensure that the reboots went smoothly."
Figure 1: (Image: Christine Westerback)
Rackspace realized a quarter of its 200,000 customers were affected and corrected the issue, taking running servers down over the weekend of Sept. 27 and 28 to repair them. It apologized to customers afterward for the lack of advance notification by emailing customers on Sept. 30 and posting the email by president and CEO Taylor Rhodes on its blog Oct. 1.
Rackspace OpenStack public cloud users were unaffected because their virtual machines are running under the KVM hypervisor. But the Rackspace Private Cloud includes many Xen users.
The nature of the bug, officially called Xen Security Announcement 108 or XSA-108, was obscure and it had not appeared in the wild prior to disclosure by SUSE's Beulich. At the same time, it would have been easy to exploit with limited coding skills and represented a potential for severe intrusion.
Advanced virtualization, unlike the earliest versions of the VMware ESX Server hypervisor, makes use of assists built into AMD and Intel x86 chips. The virtualization-award processors supply a shortcut to the hypervisor when it needs to access a specific device, such as a network interface card or other peripheral. In software-only virtualization, each instruction to the hardware in service to an application passes through the hypervisor. With virtualization-aware hardware, that step is sometimes bypassed to allow the instruction to go straight to a hardware component.
The bug in Xen was hidden in the code meant to work with those hardware assists for the Intel interrupt controller, a component on a Xeon or other x86 chip that can grant access to other parts of a server. Beulich discovered that a malicious coder, aware of the bug, could take advantage of it to
look at the contents of the memory on the host server. The controller was meant to be restricted to a tiny portion of the memory concerned with its own operation, but an exploit could order it to report data being used by other virtual machines or even the host's hypervisor itself. The hypervisor's operations themselves were not breached, but visibility into them is a serious security flaw.
The bug only affected the 4.1 and later versions of Xen, which have been out since March 25, 2011, and would be in place in most public cloud services relying on Xen. Amazon Web Services uses its own Amazon Machine Images, but they are a close variant of Xen open source code and were affected as well.
In other words, "an 'evil' virtual machine could essentially read over the shoulder of other virtual machines running on the same server, bypassing security," wrote James Gallagher on Ars Technica.
Amazon's Barr said the bug had affected "less than 10% of its servers."
Rackspace's Rhodes said, "We don't want to advertise the vulnerability before it's fixed -- lest we, in effect, ring a dinner bell for the world's cyber criminals." Rackspace was so cautious it didn't mention that the problem was buried in the Xen hypervisor. Amazon didn't either in its first mention of the problem, but cited a Xen exposure in its second, as work got underway 7 p.m. Pacific time on Sept. 25.
Rhodes said: "That's the dilemma that we faced over the Xen bug. Such vulnerabilities are regularly found in software, whether proprietary or open source. The key, once a bug is identified, is to fix it swiftly and quietly. This particular vulnerability could have allowed bad actors who followed a certain series of memory commands to read snippets of data belonging to other customers, or to crash the host server."
SoftLayer said it took the time it needed to make sure the problem was fixed: "As soon as the risk was identified, our systems engineers and technology partners have been working nonstop to prepare the update," it said.
The Xen bug is both a good example of collective security and a warning of what can happen as IT shifts toward a greater reliance on cloud computing. The bug was discovered and interested parties notified before the full nature of the exploit was disclosed. Collective security action followed, apparently (at this early date) in time before any malicious code writers could act on the disclosure.
At the same time, the bug illustrates the cloud's dependence on one hypervisor or another and how a major hypervisor bug will affect more than one supplier. The growing, more uniform nature of x86 cloud environments represent a fatter target for highly skilled intruders to aim for, and a richer environment for manipulation if they succeed at getting inside.
Considering how prevalent third-party attacks are, we need to ask hard questions about how partners and suppliers are safeguarding systems and data. In the Partners' Role In Perimeter Security report, we'll discuss concrete strategies such as setting standards that third-party providers must meet to keep your business, conducting in-depth risk assessments -- and ensuring that your network has controls in place to protect data in case these defenses fail. (Free registration required.)
About the Author
You May Also Like