Windows Server 2008 Getting Fault Tolerance Add-On

The availability of everRun as a feature of the operating system could allow Windows Server 2008 to host multiple virtual machines running mission-critical systems.
Microsoft will at least gain a competitive talking point to use with customers considering VMware as an alternative. "This work represents the value of Microsoft's partner ecosystem..." Shultz said.

VMware is working on high-availability features for ESX Server as well.

In addition to full fault tolerance, everRun allows data center administrators to select lower levels of availability. Level 1 invokes clustering failover where a single server component may fail but the work is moved to another physical server.

In Marathon's lexicon, Level 2 availability is where a system recovers all data and application processing in about 20 seconds, by handling network and storage component failures. Network and storage failures account for about 80% of the system failures, Phillips said. Level 2-style protection can be purchased for Citrix XenServer and Windows Server 2003 operations currently and in the second quarter for Windows 2008 and XenServer. At some point in the future, it will be available for Windows Server 2008 host running any number of Hyper-V virtual machines.

Level 3 system fault tolerance recovers within milliseconds after any software or hardware component fails, a disruption unlikely to be noticed by the end user. EverRun can guarantee fast recovery by maintaining a live virtual machine on an identical physical server. If one system fails, the other immediately fills in. Level 3 fault tolerance will be available for Windows Server 2008 and Citrix XenServer hypervisor in the second quarter, with a Hyper-V version available at an unspecified point in the future.

It's relatively easy to recover a failed virtual machine by itself, and several vendors offer ways to do it. One way is simply to go back to where the virtual machine's image was stored and activate another instance. But such a move does nothing to recover application processing at the point of failure or any lost data. It's more complicated to recover both the virtual machine's application processing and its data at the instant of failure.