Amazon Reboots Cloud Servers, Xen Bug Blamed - InformationWeek
IoT
IoT
Cloud // Software as a Service
News
9/26/2014
09:55 AM
Connect Directly
Twitter
RSS
E-Mail
50%
50%
RELATED EVENTS
Building Security for the IoT
Nov 09, 2017
In this webcast, experts discuss the most effective approaches to securing Internet-enabled system ...Read More>>

Amazon Reboots Cloud Servers, Xen Bug Blamed

Amazon tells customers it has to patch and reboot 10% of its EC2 cloud servers before Oct. 1.

7 Cloud Service Startups To Watch
7 Cloud Service Startups To Watch
(Click image for larger view and slideshow.)

Amazon Web Services is working hard to patch its EC2 cloud host servers for what appears to be a severe hypervisor security defect. The process started at 7:00 p.m. Pacific time Thursday. It needs to be completed in the next five days -- before Oct. 1.

During that time Amazon must shut down, patch, and reboot 10% of EC2's servers due to what appears to be a major Xen security defect. Amazon Web Services spokesman Jeff Barr confirmed in a blog post Thursday that the defect is in the Xen hypervisor.

Amazon specifically discounted any suggestion that the update and reboot had anything to do with the Bash bug, discovered Tuesday, which poses another threat to many systems around the Internet.

Amazon chose Xen in 2006 to run its Amazon Machine Image virtual machines. The supplier of Xen, the XenSource unit of Citrix, has posted a notice on Citrix's security updates page that it will air both a major vulnerability known as XSA-108 and its patch on Oct. 1. No details on the vulnerability have been published. That's to give major Xen users, such as Amazon, time to apply the patch before it becomes public. On Sept. 24, Amazon began notifying customers that the patching and rebooting would occur over the next few days.

"These updates must be completed by Oct. 1 before the issue is made public as part of an upcoming Xen Security Announcement (XSA)," Barr wrote.

[Want to learn more about the use of zones to avoid an outage? See Amazon Outage: Multiple Zones A Smart Strategy.]

"As we explained in emails to the small percentage of our customers who are affected… the instances that need the update require a system restart of the underlying hardware and will be unavailable for a few minutes while the patches are being applied and the host is being rebooted," Barr explained.

Most operating system, hypervisor, and application updates can be applied without taking the whole server down. But "certain limited types of updates require a restart," continued Barr. Amazon will stagger the updates so that no two regions or availability zones within a region are down at the same time. That means that followers of Amazon's best practices who have a backup system ready to go in a second availability zone are likely to experience no outage. But not all customers use the two-pronged guarantee of availability. Those running a production system in one zone are likely to experience a brief downtime.

Each zone's hosts "will restart with all saved data and all automated configuration intact. Most customers should experience no significant issues with the reboots," Barr wrote.

"We understand that for a small subset of customers the reboot will be more inconvenient; we wouldn't inconvenience our customers if it wasn't important and time-critical to apply this update."

Exactly how long that "inconvenience" might be may depend on the individual customer's situation.

"This update looks extremely critical," said Sebastian Stadil, CEO of Scalr, a supplier of a front-end cloud management platform for Amazon and other service suppliers. It looks like a major hole in the Xen hypervisor combined with few barriers to its use, and it's easy to implement, judging by the urgency with which Amazon is going about its updates, he said.

If it is easy to implement, once it's aired the "script kiddies will write scripts that cause lots of destruction just for fun," Stadil said in an interview.

Giving major users of the technology time to update before the vulnerability is made known is now a standard practice among researchers and security vendors, who often find the vulnerabilities. "It's considered a best practice," he said.

Stadil expressed puzzlement that Rackspace, another large user of Xen, hasn't revealed a similar update plan. Rackspace uses a version of Xen in its public cloud services; it relies on KVM in its managed cloud and private cloud operations. Rackspace couldn't be reached for comment late Thursday.

Another cloud management platform, RightScale, also gave notice to its users that the Amazon update was underway. CTO Thorsten Von Eicken wrote in a blog, his second on the subject, yesterday: "The fact that we're simultaneously 'blessed' with both this AWS event as well as the (likely unrelated) bash exploit adds to the overall emergency workload and stress" of customers.

But there are steps they can take to try to insure minimum impact on their cloud systems, he wrote. Round-the-clock services on Amazon with their own operations teams that have designed for fault tolerance "will pre-emptively take care of relaunching affected instances and will not have an issue."

They may be in the minority, however. Large-scale Amazon customers running thousands of virtual machines "will face a stiffer challenge because even manual triage of which instances are affected and which are not is impractical. The ops team may be scrambling to write scripts to automate some of these tasks."

Software-as-a-service vendors on Amazon with single-tenant instances of their applications "may have hundreds of small deployments, each of which may need some intervention," Von Eicken predicted. Amazon presents reports on consolidated accounts. Operations teams may have to sort through the accounts, looking for affected systems.

Relaunching workloads will cost smaller services time as they manually identify and undertake the process in advance of the downtime. "Some failures" are likely to occur and be visible to end users, he noted.

Those whose operations personnel are on vacation or non-existent will have to count on their servers to reboot on their own, as Amazon predicts, after a few minutes of downtime. That's possible, Von Eicken continued, but "the reality is such that I would expect a high failure rate anyway."

He also warned about "forgotten servers," buried in organizations with multiple accounts, some of which they've lost track of. The AWS email notifications in some cases are going to unread or deleted email inboxes. "I expect that almost every organization ends up with a few 'oops, forgot we had'… and some of these will turn out to cause user visible failures."

If the world weren't changing, we might continue to view IT purely as a service organization, and ITSM might be the most important focus for IT leaders. But it's not, it isn't, and it won't be -- at least not in its present form. Get the Research: Beyond IT Service Management report today. (Free registration required.)

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
9/26/2014 | 4:39:40 PM
Not in the wild yet, we hope
We have to await the details of Citirx' Oct. 1 announcement. Let's hope it's not in the wild yet. But once news of the exploit and the details of how it works are published, it will go into th wild, unless servers are protected against it. 
RightScale1
50%
50%
RightScale1,
User Rank: Apprentice
9/26/2014 | 4:36:01 PM
AWS REbbot FAQs
AWS Reboot FAQs are available on the RightScale Blog
Laurianne
50%
50%
Laurianne,
User Rank: Author
9/26/2014 | 3:27:32 PM
Re: If hypervisor compromised, so potentially are 'private cloud' tenants
So we are talking about a hypervisor defect in the wild. That is notable given this has been a theoretical "what if" discussion until now among the virtualization community.
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
9/26/2014 | 12:16:01 PM
If hypervisor compromised, so potentially are 'private cloud' tenants
Amazon offers p-rivate cloud in a multi-tenant setting. The VPN and other connections are private, but on the same multi-tenant server sits a customer with an unsecured desktop. That's OK, provided there's no breach in the barriers imposed by the hypervisor, points out Scalr's Sebastian Stadil. If there is a breach, then Amazon's concept of private cloud as well as customers' "private" production systems are at risk. That's probably part of the reason Amazon is acting with such urgency in this case.  
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
2017 State of IT Report
In today's technology-driven world, "innovation" has become a basic expectation. IT leaders are tasked with making technical magic, improving customer experience, and boosting the bottom line -- yet often without any increase to the IT budget. How are organizations striking the balance between new initiatives and cost control? Download our report to learn about the biggest challenges and how savvy IT executives are overcoming them.
Video
Slideshows
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll