VMware's ESXi Has Backup Bug - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Cloud // Cloud Storage
09:45 AM
Connect Directly

VMware's ESXi Has Backup Bug

The version of VMware's hypervisor that's embedded in shipping servers has a bug that under certain circumstances corrupts backup virtual machines.

10 Big Data Online Courses
10 Big Data Online Courses
(Click image for larger view and slideshow.)

There's a bug buried in VMware's ESXi hypervisor, the one that's embedded on x86 servers that are shipped to be used as ESX Server virtualization hosts in enterprise data centers. If it is inadvertently activated during a backup, it means the recovery version of a virtual machine won't work.

The bug only appears when a VMware Virtual Machine Disk (VMDK) is expanded beyond one of several thresholds, starting with 128 GBs. A VMDK will manifest the problem if it is expanded at that level or above 256 GBs.

A backup system using Changed Block Tracking will make periodic updates to a backup system, capturing only the blocks of data that have changed. The incremental captures are quick and less demanding on the network than a full backup. If an expansion of the virtual disk crosses the 128-GB threshold, however, the command QueryChangedDiskAreas will yield an erroneous list of allocated sectors. There's no known way to locate or retrieve the data on the missing sectors. The same is true at the 512-GB, 1,024-GB, and 2,048-GB thresholds; apparently any time you multiply the last threshold by two, the next one is reached.

These are major changes in the size of the virtual hard disk, and many customers are happy to set up VMDKs without implementing such major changes. But some of VMware's largest customers and VMware-compatible cloud suppliers, which are managing extra-large virtual machines on a large scale, do implement such changes.

"We are aware of the issue," a VMware spokesman said. "We have provided customer guidance via a Knowledge Base article (2090639). The article was published on Oct. 7 and has been updated -- most recently Nov. 4 -- with more details."

[Want to learn more about issues in virtual machine disaster recovery? See Cloud-Based Disaster Recovery: Choose Wisely.]

Spokesmen for third-party backup software supplier Veeam Software said VMware's Oct. 7 version of the article was "a little vague," and Veeam backup experts suggested clarifications in the guidance, which appeared in later versions, according to Doug Hazelman, VP of product strategy at Veeam.

"We will update the article as we have additional information, guidance, and a resolution to share with our customers," added the VMware spokesman for the Storage and Availability business in an email response to InformationWeek Tuesday.

However, users of Changed Block Tracking with ESXi that don't expand the virtual hard disk won't be bothered by the problem, which is the case with many users of VMware virtual storage. Or if they expand the virtual disk without crossing one of the designated thresholds, they will also remain free of any corrupted recovery copies. That's one reason the bug has remained hidden so long.

"It's essentially a problem that's always been there," said Veeam's Hazelman. Veeam's products use the Change Backup Tracking provided through VMware and invoke the QueryChangedDiskAreas command.

If they do cross a threshold, however, they may one day need to replace a failed primary system with its backup copy, stored as a corrupted VMDK. When the command is given calling up the system, "the VM won't start," warned Hazelman. Indeed, that's one of the problems with the CBT bug. Corrupted backup copies may exist without their owners realizing it, until they need them.

Some of its customers read the Oct. 7 Knowledge Base article posted by VMware and started asking questions on Veeam's customer forums on Oct. 26. Mainly, they asked whether they were affected. The problem appears to be in all releases of ESXi 4 and 5. ESXi 5.5 Update 2 came out Oct. 13 and is believed to include the bug.

VMware has been trying to solve the problem at least since early October, but it's clearly proving hard to resolve. In the meantime, it has provided a workaround, so customers may expand their virtual disks if they choose and go over one of the thresholds without losing any backup data.

A customer planning to cross one of the virtual disk thresholds needs to "disable CBT, with their next backup constituting a full backup that resets the CBT table," said Hazelman. With a new CBT table that starts out above the threshold, the CBT function can be turned back on and used until the virtual disk approaches the next threshold level.

Veeam has an additional way to cope with the problem. Its Backup and Replication, Enterprise Edition, product has a SureBackup and Recovery Verification function. It activates the backup copy to make sure it's a functioning virtual machine, then shuts it down again.

If you just look at vendor financials, the enterprise storage business seems stuck in neutral. However, flat revenue numbers mask a scorching pace of technical innovation, ongoing double-digit capacity growth in enterprises, and dramatic changes in how and where businesses store data. Get the 2014 State of Storage report today. (Free registration required.)

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Threaded  |  Newest First  |  Oldest First
User Rank: Apprentice
12/9/2015 | 5:45:06 AM
VMware CBT Reset PowerShell Cmdlet
There is already a patch for it but after applying the patch you need to reset CBT on all VMs, that is an arduous task.


There is a powershell script/cmdlet that automate this operation on multiple VMs. How to use here:


How GIS Data Can Help Fix Vaccine Distribution
Jessica Davis, Senior Editor, Enterprise Apps,  2/17/2021
Graph-Based AI Enters the Enterprise Mainstream
James Kobielus, Tech Analyst, Consultant and Author,  2/16/2021
11 Ways DevOps Is Evolving
Lisa Morgan, Freelance Writer,  2/18/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll