There's a bug buried in VMware's ESXi hypervisor, the one that's embedded on x86 servers that are shipped to be used as ESX Server virtualization hosts in enterprise data centers. If it is inadvertently activated during a backup, it means the recovery version of a virtual machine won't work.
The bug only appears when a VMware Virtual Machine Disk (VMDK) is expanded beyond one of several thresholds, starting with 128 GBs. A VMDK will manifest the problem if it is expanded at that level or above 256 GBs.
A backup system using Changed Block Tracking will make periodic updates to a backup system, capturing only the blocks of data that have changed. The incremental captures are quick and less demanding on the network than a full backup. If an expansion of the virtual disk crosses the 128-GB threshold, however, the command QueryChangedDiskAreas will yield an erroneous list of allocated sectors. There's no known way to locate or retrieve the data on the missing sectors. The same is true at the 512-GB, 1,024-GB, and 2,048-GB thresholds; apparently any time you multiply the last threshold by two, the next one is reached.
These are major changes in the size of the virtual hard disk, and many customers are happy to set up VMDKs without implementing such major changes. But some of VMware's largest customers and VMware-compatible cloud suppliers, which are managing extra-large virtual machines on a large scale, do implement such changes.
"We are aware of the issue," a VMware spokesman said. "We have provided customer guidance via a Knowledge Base article (2090639). The article was published on Oct. 7 and has been updated -- most recently Nov. 4 -- with more details."
[Want to learn more about issues in virtual machine disaster recovery? See Cloud-Based Disaster Recovery: Choose Wisely.]
Spokesmen for third-party backup software supplier Veeam Software said VMware's Oct. 7 version of the article was "a little vague," and Veeam backup experts suggested clarifications in the guidance, which appeared in later versions, according to Doug Hazelman, VP of product strategy at Veeam.
"We will update the article as we have additional information, guidance, and a resolution to share with our customers," added the VMware spokesman for the Storage and Availability business in an email response to InformationWeek Tuesday.
However, users of Changed Block Tracking with ESXi that don't expand the virtual hard disk won't be bothered by the problem, which is the case with many users of VMware virtual storage. Or if they expand the virtual disk without crossing one of the designated thresholds, they will also remain free of any corrupted recovery copies. That's one reason the bug has remained hidden so long.
"It's essentially a problem that's always been there," said Veeam's Hazelman. Veeam's products use the Change Backup Tracking provided through VMware and invoke the QueryChangedDiskAreas command.
If they do cross a threshold, however, they may one day need to replace a failed primary system with its backup copy, stored as a corrupted VMDK. When the command is given calling up the system, "the VM won't start," warned Hazelman. Indeed, that's one of the problems with the CBT bug. Corrupted backup copies may exist without their owners realizing it, until they need them.
Some of its customers read the Oct. 7 Knowledge Base article posted by VMware and started asking questions on Veeam's customer forums on Oct. 26. Mainly, they asked whether they were affected. The problem appears to be in all releases of ESXi 4 and 5. ESXi 5.5 Update 2 came out Oct. 13 and is believed to include the bug.
VMware has been trying to solve the problem at least since early October, but it's clearly proving hard to resolve. In the meantime, it has provided a workaround, so customers may expand their virtual disks if they choose and go over one of the thresholds without losing any backup data.
A customer planning to cross one of the virtual disk thresholds needs to "disable CBT, with their next backup constituting a full backup that resets the CBT table," said Hazelman. With a new CBT table that starts out above the threshold, the CBT function can be turned back on and used until the virtual disk approaches the next threshold level.
Veeam has an additional way to cope with the problem. Its Backup and Replication, Enterprise Edition, product has a SureBackup and Recovery Verification function. It activates the backup copy to make sure it's a functioning virtual machine, then shuts it down again.
If you just look at vendor financials, the enterprise storage business seems stuck in neutral. However, flat revenue numbers mask a scorching pace of technical innovation, ongoing double-digit capacity growth in enterprises, and dramatic changes in how and where businesses store data. Get the 2014 State of Storage report today. (Free registration required.)Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio