How To Survive A Windows 2003 Cluster Crash

A complete cluster crash is a server admin's worst nightmare. Our intrepid columnist describes his, how he dealt with it--and how you can avoid it.
Now, why have I noted that SQL Server can take as long as ten hours to recover? It should not be this way. As easy as it is to install SQL Server on a virgin cluster (easier that Exchange 2003) the opposite is the case if you have to rebuild the cluster and reinstall SQL Server on the same nodes (even after cluster cleanup). You cannot simply install SQL Server again. You first have to remove all remnants of SQL Server from the nodes before you can reinstall the service and even after this SQL Server may not make a full recovery. Let’s first look at the initial options.

The cluster utilities you have to recover a brain dead cluster are far a few between and not one is the holy grail of cluster repair kits. The first tool you can use to detect any sign of life from the cluster is the Cluster Diagnostics Tool (clusdiag). This tool can help troubleshoot cluster problems. If the cluster is not starting because some dependent resource (such as storage) is down, clusdiag will report this. However if the cluster database is damaged you will not get much help from clusdiag to restore it.

The cluster utility (cluster.exe) offers limited help recovering from failure. It is only useful to clean up a dead cluster database and restore the cluster database (clusdb) to its virgin state. Again if the cluster database is corrupt you don’t have much choice but to restore it. A bad clusdb cannot be fixed with packing, reindexing or something similar. Its not an Access database or a something like a config file.

The cluster database and the content of associated files (like the quorum logs) are distributed. Each node gets an identical copy of the database so that it can operate in the cluster when it is called into failover duty. Thus, if the cluster database is dead on one node it is almost certainly dead on all the nodes. You might be in luck if a cluster node was taken out of the cluster (through shutdown rather than eviction) before the corruption, but the chance of that fortune befalling you is slim.

It is thus critical to keep daily or very recent backups of your cluster database because recovery of a failed cluster lost due to corruption of the cluster database is only possible through restore. However, you can’t simply backup the cluster database. Many backup tools that simply backup files do not backup the cluster database. Here’s why.

The Windows Server 2003 cluster database contains the cluster state data that is replicated among nodes of a server cluster. This ensures that all nodes have a consistent configuration. However, the clusdb database is actually the cluster hive of the registry. So because of the distributed nature of clusters, backing up a local cluster database contained in the clusdb hive is not sufficient to ensure that the full cluster state has been saved. The data is stored in a number of places in the quorum resource and the database and you need the proper backup tool to obtain reliable backup data.

There are four groups of data that are critical to the proper operation of Windows Server 2003 clusters. These groups are as follows:

The cluster disk signatures and their partitions The cluster quorum data The actual data on the shared cluster disks The data on the individual cluster node Before you can back up any data on the server cluster nodes, you need to make sure that you backup the cluster disk signatures and partitions. This is achieved using Automated System Recovery (ASR) in the Backup Wizard. Backing up this data is mandatory if you later need to restore the signature of the quorum disk. You will lose the quorum disk signature if, for example, you are caught up in a complete system implode as described earlier. Inevitably, the signature of the quorum disk will have changed since you last backed up.

You should always activate ASR. It will be invaluable to help you recover a dead cluster. The ASR is a two-part recovery process consisting of ASR backup and ASR restore. These tools are accessed in the Backup or Restore Wizard in Advanced mode. We will return to backup and restore of clusdb in Part II of this article. I will provide a detailed discussion of backing up and restoring the cluster data.

Now back to cluster triage. Unfortunately our case here has taken a turn for the worse. Doing diagnostics on the server nodes, we were quick to discover that the cluster data was corrupt and irrecoverable using any utility. We quickly turned to the backup restore option. Again, another death blow to our cluster. The technician backing up the clusters was not backing up the cluster data (system state) as required. He was simply backing up the server using our standard server backup system. We thus did not have reliable cluster data to recover our cluster.

We thus found ourselves stuck with option 3. The old cluster was given its last rights and steps were taken to salvage services to be donated to the new cluster. At this point it must be noted with extreme emphasis that while the cluster is dead, the resources and applications that relied on it will not be. In the case of Exchange 2003, the mailbox stores and Exchange configuration data in the registry and Exchange databases and logs are all intact. Make sure you back up the active node and data disks that were intact at the time of the cluster failure.

If you have a failed cluster carrying Exchange 2003 or SQL Server 200 or both then take note of what I am about to say. DO NOT DO ANYTHING to your Exchange binaries, databases, logs and so on. In short leave Exchange on the failed cluster nodes completely alone and do nothing but backup the node and Exchange data on the shared cluster disks.

Now SQL Server is a little different. Your old SQL Server 2000 system cannot be resurrected. All, however, is not lost with SQL Server. You still have access to all your SQL Server databases. So make a copy of NOW of all databases that were attached at the time of the failure. You are going to need them for the restore of SQL Server. These databases include all system databases such as Master and TempDB.

Also make copies of your databases’ transaction logs. In short backup the entire data directory of your SQL Server installation on the shared disk resource. Once you have completed the backup rename that data directory so that nothing can overwrite the data in it. You should rename the entire SQL Server installation folder on the shared disk. (Typically it will be something like S:\Program Files\\SQL Server\Data or something similar. Now change it to S:\Program Files\Old SQL Server \Data or something similar.)

You are now ready to reinstall your cluster and reattach Exchange 2003 and SQL Server 2000 to it. This process will be discussed in Part II. The night will be long and arduous. In the end you will recover your cluster, Exchange and SQL Server. In my case we had three thousand users that were going to connect to Exchange in less than 10 hours and the SQL Server databases were going to be needed to service more than two thousand customers. The smell of the ocean and Alaskan crab bait was unmistakable in my mind as I considered my fate should we fail.

Editor's Choice
Brandon Taylor, Digital Editorial Program Manager
Jessica Davis, Senior Editor
Terry White, Associate Chief Analyst, Omdia
Richard Pallardy, Freelance Writer
Cynthia Harvey, Freelance Journalist, InformationWeek
Pam Baker, Contributing Writer