When I was planning the infrastructure for my revitalized lab, I intended to have VMware play a central role in my network and application testing. While that objective was eventually met, it didn't turn out like I had planned, and the path was extremely circuitous, involving multiple changes in strategic direction. Now with the recent release of VMware ESX 3.0 and VMware Server 1.0 (a replacement for the GSX line), I'm having to revisit the decisions all over again.
I need to be able to run a wide variety of systems so that I can test network applications and protocols across a large number of platforms. VMware provides a great platform for that kind of work. For instance, I might need to look at the way that different operating systems implement a specific technology within their networking stacks, while another project might require me to compare the security protocols that are offered by a handful of e-mail servers. This kind of work is a natural fit for VMware, since it lets me avoid having to buy and manage a dozen different PCs that are all running different operating systems.
Originally my plan was simply to use the top-of-the-line VMware ESX 2.5 platform for this. Even though it's a little bit overkill for my expected usage, its performance and management features make it clearly superior to the mid-tier GSX platform, and also promise to translate into better overall testing.
In particular, whereas GSX runs as a system service under Windows or Linux, and thus relies on the operating system to provide basic file I/O and memory management, ESX employs an independent microkernel that manages these types of resources directly, giving it much better performance (ESX does use Linux for the "console" and some rudimentary system services, but the important stuff is handled by ESX directly). Another benefit to ESX is that it exposes system-management interfaces via a local /proc interface and also through SNMP, while the only management interface for GSX is a relatively limited WMI performance counter, and that's only available if Windows is used for the host operating system.
I knew that ESX had some peculiarities that would take some effort to overcome. For example, I knew that direct hardware management translated to stricter hardware requirements, and that using a relatively niche platform meant having less packaged software available. I also knew that ESX supported fewer guest operating systems than GSX. But I figured I could overcome these hurdles, and even if it took more up-front time to get things working, in the end it would take less time and energy than trying to manage multiple PCs. I was wrong on all these counts.
I was not at all prepared for just how strict the hardware requirements for ESX really were. Although the documentation lists a couple of dozen SCSI controllers that are listed as "supported," I figured this was just the usual narrow list of "endorsed" hardware, and that ESX was really a lot more flexible than that. But given the way that ESX manages resources directly, it actually does require specific hardware, and if a device isn't on the approved list then it probably isn't going to be recognized.
For instance, I'd already bought a box of SATA II drives to use for this project, but none of the SATA II controllers were recognized by ESX. I was eventually able to use an Intel SRCS14L SATA I controller that happened to share a SCSI driver with a supported card, but a long-term solution would essentially require switching to real SCSI, which conflicted with my own strategy of using SATA II everywhere.
I ran into similar kinds of problems with some other devices too. For example, I had picked up a Crystalfontz CF635 LCD panel for displaying real-time performance data from ESX and the virtual machines, but the USB drivers in ESX couldn't find that device either. Similarly, there was no support in the Linux kernel for my motherboard's hardware sensors, meaning that I couldn't even monitor system health remotely like I do with all my other systems.
Usually, you can just recompile some software in these kinds of situations, but you couldn't do that very easily with ESX 2.5, largely because VMware didn't provide all the necessary tools, and also because they strongly discouraged the practice. For most people, even getting user-space applications to run meant setting up another system with a similar version of Linux, installing whatever software you need onto that system and then copying the filesystem changes over (I had to do exactly that for my UPS software). But this strategy doesn't really work for low-level device drivers which sometimes require changes to the kernel itself.
The straw that finally broke the camel's back, however, was the lack of support for some critical operating systems as guest VMs. I need to have a Solaris VM for basic interoperability tests, and it was next to impossible to get this working on ESX 2.5, because neither of the VMware-specific SCSI interfaces were supported by Solaris. After wrestling with this and the other problems for a couple of weeks, I decided that a change in strategic direction was in order.
Rounds 2 And 3
My first fallback position was to switch to VMware GSX, running on 64-bit SUSE Linux 9.3. At first this proved to be a good choice, since I was able to use most of the hardware and applications that I needed, and the broader range of supported guest operating systems also meant that I could pretty much run whatever I needed to. After the shine wore off, however, a new set of problems started to emerge.
In particular, the lack of any kind of remote management interface to VMware and the virtual machines proved to be extremely frustrating. There simply aren't any management interfaces to GSX under Linux at all. None. The best I could do was have a custom script parse the system process table for relevant data (memory in use, processor load, etc.), but even that proved to fall far short.
I had other kinds of problems too, either with GSX or with the overall system. For example, some of the command-line tools provided with GSX depend on a 32-bit Perl module, but that module wouldn't compile under 64-bit Linux (I had to compile the code on another platform and copy the 32-bit Perl subsystem to the VMware host). Worse, every time SUSE patched the Linux kernel, GSX forced me to rebuild everything, which made some of my cobbled solutions completely unfeasible as long-term strategies.
For all of these reasons (plus some others), I eventually switched to GSX under 32-bit Windows XP, which allowed me to resolve almost all of my outstanding issues. Even though GSX on Windows still does not provide any kind of SNMP interface to the virtual machines or host process, it does provide some WMI performance data that I can tap into locally. Unfortunately, VMware's WMI performance monitor DLL appears to have one or more significant bugs that are preventing the data from being read through other channels, but for the moment I am at least able to monitor per-VM resource utilization through perfmon and Crystalfontz CrystalControl2 LCD software (if you're curious to see what this looks like, you can watch some video.) Since GSX runs as a Windows service, I'm also able to use any hardware drivers that are needed. For example, I can use SuperMicro's own SNMP extensions, with a Cacti script and template to give me the needed visibility into the host platform's hardware. I'm also able to use any of the storage cards of my choosing (such as using iSCSI to access my SATA II drives over the network), since the volumes are managed by the Windows host and not by GSX directly.
Overall, I'm actually pretty pleased with where I've ended up. VMware GSX on Windows has proven to be robust and stable, and fast enough for what I do. I'm able to easily test most of what I need (there are some exceptions, such as hardware-specific applications), I've got a modicum of management visibility into the host and the virtual machines, and I've been able to substantially reduce some costs by consolidating a dozen systems into a single server (it's really nice to be able to suspend all the VMs and then put the host into standby power mode, essentially putting a dozen PCs to sleep all at once).
One surprise in all of this, however, is that I haven't really saved that much time over having to manage a collection of individual systems. Sure I've saved money on capital investment and energy costs, but all of the virtual systems still have to be configured and managed to the same extent as real systems, so my operational expense has not changed very much (this is especially true considering the time I spent on trying to get Solaris to work, which easily exceeded what would have been required of using a real PC that probably would have worked immediately).
Nothing stands still in this industry, however, and given that VMware is in the process of releasing new product lines, I'm having to reconsider my platform choices all over again.
For starters, there is the recent release of ESX 3.0. Even though most of the new features in ESX 3.0 are oriented towards datacenter installations, it also seems to address many of my (admittedly) pedestrian needs. In particular, the product now has official support for Solaris 10 and some other operating systems that I need to use. It also has initial support for iSCSI (albeit software-only at this point), which would allow me to use my SATA II drives. It also appears to have some improved management tools, not to mention that I would get SNMP and local management visibility into the host and virtual machines. On the other hand, ESX 3.0 still seems to have very strict hardware requirements (but at least this time I know to believe the specs), and installing custom software still looks to be difficult and discouraged.
But I can't stay with GSX for much longer either, given that the company is in the process of replacing the GSX line with the new VMware Server product, which just started shipping too. VMware Server also promises some of the same improvements found in ESX 3.0, especially in the areas of official support for guest operating systems, and better administrative tools. But it does not yet address my need for SNMP visibility (and the WMI bugs don't seem to have been fixed either), nor does it appear to have any kind of explicit support for iSCSI (the technology is very nearly disavowed by the support personnel).
All told, ESX and VMware Server are pretty much tied on my features scorecard. For now, I think the deciding factor has to be hardware. In particular, ESX 3.0 currently only has official support for its own software-based iSCSI initiator, and that's the only way I can get to my SATA II array. On the other hand, I can use any of the hardware-based iSCSI adapters I want to with VMware Server as long as there is a Windows driver, and as long as I don't have to talk to VMware support about it (and I'm about to experiment with some QLogic QLA4052C dual-gigabit adapters just this purpose). However, VMware is planning to add hardware-based iSCSI initiator support in a subsequent point release, and at that point the scales will tip back in its favor.