High availability (HA) means building redundancy into critical business systems so the failure of one component, such as a bad NIC in a server, won't cause the application to fail. It's similar to building multiple highways into a city: If problems arise on one road, traffic can take an alternate route. Standard HA practices include deploying redundant hardware and using multiple network links.
The decision to build highly available systems, however, should always be a business decision. That's because HA is expensive, so by working with the business units, you can focus on the services that are truly critical to the company. With a clear mandate in hand, you can formulate a cost-effective strategy. Then the pressure is on IT to pick the right technologies to get the job done.
Before you start building multiple roads, you have to know the amount of traffic to accommodate. In other words, don't build a six-lane highway if you only need two. Every application consumes memory, processing, network capacity, and storage in a unique way. As a result, the first part of the process is to assess the resources your core applications need to run properly. Your application and workload analysis will naturally unearth the degree of availability you need to get the job done, and your design will flow from there.
That's the technical side. You also need to work with the business side to calculate how much the downtime of critical applications costs the company--which in turn will dictate how much you can spend to keep systems up and running. For instance, do seconds or milliseconds matter on a transaction-by-transaction basis? Such calculations will determine whether you need to build a two-, four, or six-lane highway.
Let's use Microsoft Exchange as an example. Business has very little patience for email being down, so spending the dollars to bring more availability to your messaging environment is the answer. To assess your application requirements, look at factors like CPU utilization and the kinds of messaging load you support; for instance, your users may send high volumes of email with large files attached. These data points will influence your server and storage designs.
Let's say you run Exchange 2007 and support 500 users with medium to heavy workloads. Your back-end mail server is averaging 10% CPU utilization on an eight-core server. You also have a separate set of virtualized servers that average 35% CPU utilization. In this scenario, you can safely add a virtualized instance of Exchange to act as a redundant cluster node without affecting existing services.
Once your analysis is complete, you can begin laying the foundations for high availability in your network, storage, and application infrastructures.
No matter what your application, whether it's messaging, a point-of-sale system, or a complex business analytics package, adding network resiliency makes sense because the network is where you're likely to experience failures.
Step one to building network resiliency is to take advantage of capabilities already built into your gear. Assuming you're running enterprise-class routers and switches, you already have access to some of the features needed to do the job. For instance, many routers support protocols that provide basic redundancy. Key protocols to use include Hot Standby Router Protocol and Virtual Routing Redundancy Protocol, both of which allow a secondary router to take over if the primary fails, ensuring that traffic continues to flow.
The next step is to extend that redundancy throughout the different tiers of your network. For example, do you have critical applications that need to be accessible from outside your network? If so, you may need to add redundant firewalls and routers along your path to the Internet. For customer-facing business applications, consider adding a second Internet link from your provider. And since you're accounting for all variables, look into using more than one carrier.
Don't forget the critical servers connected to your network. The shift to highly virtualized environments has prompted all the major server vendors to add multiple-gigabit Ethernet NICs to their system boards. This means you can take advantage of NIC teaming, in which multiple interface cards appear as a single interface to the virtual machines on a server. If one physical NIC fails, traffic still passes to the virtual machines. NIC teaming extends HA right down to the server level.
Network issues are comparatively easy to fix, but storage issues are not, which makes a fully redundant storage design imperative for critical applications. And don't just point to your use of RAID and call it a day. RAID is great for data protection, but it won't cut it as a comprehensive clustered storage strategy.
The process and ultimate cost to implement storage HA depends greatly on the storage transport you select, the capabilities of your storage platform, your I/O requirements, and to some degree the capabilities of your network.
Fibre Channel is a popular storage network technology and many companies have invested in it. If this includes you, at the very least you should add redundant host bus adapters (HBAs) to critical servers, and those servers should be collapsing back to redundant Fibre Channel switches.
However, Fibre Channel is expensive, and it's often overkill for many small and midsize companies. A sound alternative is iSCSI, which is widely deployed and very stable when properly implemented.
From a performance perspective, logic would indicate that 4-Gbps Fiber Channel is four times as fast as 1-Gbps iSCSI. However, according to benchmarks run by Enterprise Strategy Group, a software-based 1-Gbps iSCSI target can pump through approximately 80% to 85% of the throughput and input/output operations per second of a 4-Gbps Fibre Channel HBA. Given that many enterprise-class servers have four 1-Gbps NICs on board, you essentially get 85% of what Fiber Channel is giving you for free, because any Gigabit Ethernet NIC can be turned into an iSCSI adaptor.
As a result, budget-strapped IT managers who need to build solid storage HA can do so much more cheaply with iSCSI. You don't necessarily need a name-brand SAN, either. Smaller shops can easily build a starter iSCSI SAN using a server with direct-attached storage: All you need is software to spin that DAS storage into iSCSI. FreeNAS is one open source option you can use to turn DAS into an iSCSI target, and there are others.
Another option is Fibre Channel over Ethernet (FCoE), which lets IT shops use their existing Ethernet network for storage transport, instead of having to build out a Fibre Channel storage network with its specialized (and expensive) switches, HBAs, and cabling. The increasing penetration of 10-Gbps Ethernet makes FCoE a particularly viable alternative, and having a converged data and storage network also reduces management complexity and costs.
Regardless of your storage network, consider storage replication between multiple storage chassis. Discussions of storage replication inevitably lead to talk of disaster recovery and business continuity, which are beyond the scope of this article. That said, it's essential that you calculate the costs of downtime of business-critical applications. If being down will cost you hundreds of thousands of dollars in a given time period, you should also extend your availability and replication capabilities to a disaster recovery site that has full Layer 2 connectivity. If your business is in an urban area, there's a good chance a colocation facility is close by, and you can lease fiber from a provider that will give you sufficient bandwidth for your disaster recovery needs.
In addition to building availability into your network and storage systems, consider your application infrastructure. The most common method for increasing application availability is by building a cluster. A cluster includes hardware and software components. On the hardware side, you can add multiple servers, so if one component in a server goes down, another component in the cluster takes up the slack. Virtualization will also come into play in your server cluster: Running an application inside a VM can make it easier to restore the operating system and application if necessary. You can also take advantage of live migration, which can move VMs to different physical servers. This feature comes in handy if you detect conditions that indicate a server is about to go down.
The hard part of building a cluster often comes in integrating all the software components that make up a complex, multitiered application. Broadly speaking, for an application to be "cluster aware" it must be able to synchronize state information between nodes, so existing sessions can dynamically fail over to another node in the cluster.
Consider our earlier messaging example. Windows Server supports a few HA scenarios for building a fully clustered Exchange environment through the use of failover clustering services. In Windows 2008 Server, it can simply be added as a server feature (not a role). Failover clustering works by listening for heartbeats (effectively, a ping response) from all the nodes in the cluster over a dedicated Layer 2 link. If the heartbeat is no longer detected on the primary node, all Outlook client connections are transferred over to a passive node.
Clustering a multitiered application is a whole different ball of wax. Most scalable applications have a Web access layer (such as IIS, Apache, or Tomcat), a middleware layer to process core business logic (WebSphere Application Server or the like), and a back-end database. The challenge now becomes making each layer cluster-aware.
The Web Challenge
The Web access layer presents unique challenges. For example, say you're running a Web application using IIS. From a clustering perspective, when should a Web application fail over to a passive node? When the IIS service fails completely? When a TCP socket can no longer be established on the Web server? When you start detecting HTTP 500 (Internal Server Error) or HTTP 503 (Service Unavailable) errors?
Unfortunately, using Microsoft failover clustering for IIS isn't particularly effective for providing HA for your Web app. The recommended fallback for creating some fault resiliency in IIS is to use the Microsoft Network Load Balancing feature. However, NLB does not synchronize session state data, so if a Web server is lost at any point in time, all the session state data on the failed node is lost.
Therefore, it's common to use a dedicated load balancer (hardware or software) in front of your Web server farm. The load balancer makes dynamic decisions about the health and capacity levels of the Web servers it sits in front of. F5 is the market leader in load balancing and application delivery controllers (ADCs). Cisco Systems and Citrix are also major players in this market.
An ADC will serve you well at the Web access layer, but you also need to cluster the middleware layer. It doesn't do you any good to have an ADC in front of multiple Web servers if they are all collapsing back to a single application server. Major application servers, such as IBM WebSphere, Oracle Application Server, and Sun App Server, include internal clustering capabilities.
At the database level, if you're using Microsoft SQL, you can use Microsoft Cluster Service to build a multinode shared storage SQL cluster. MSCS (known as failover clustering in Windows Server 2008 R2) provides failover capabilities built directly in the Windows operating system.
In Oracle land, Oracle Fail Safe can be used in conjunction with MSCS as a software-based HA tool for Oracle databases running on Windows boxes.
Another option is to use a third-party proxy to add load balancing and failover capabilities for your databases. For example, Citrix's NetScaler DataStream can proxy SQL transactions between the client and server and route database I/O in a similar fashion to the way Web server traffic is distributed among nodes in a cluster.
A benefit of using a proxy is that read and write SQL statements can be intelligently issued to the back-end database to increase efficiency, which gives you an immediate scalability boost to your database servers.
In today's always-on businesses, HA isn't a luxury. The failure of critical IT systems can have direct costs to the business, from reduced user productivity to loss of customer revenue and goodwill. Therefore it's incumbent on both IT and business leaders at small and midsize enterprises to make HA a priority for core applications.
By targeting key network, storage, and application layers, and by taking advantage of both built-in device capabilities and third-party products, IT can make the company more resilient to the inevitable failures that occur in any technology system.
Randy George is a network engineer and InformationWeek Analytics contributor. Write to us at [email protected]