Disaster recovery has been a challenge for IT organizations ever since the core business realized it couldn't permanently lose mission critical data and expect to remain in business.
For several years now, instead of debating whether they could afford a disaster recovery system, IT staffs have been debating how a system should be set up to meet their mean time to recovery goals. That is, how long can the business afford to have data and systems temporarily unavailable?
Now there's a new stripe of thinking out there that says thinking about, planning and implementing disaster recovery has been the wrong idea all along. It tolerates business outages and sometimes small losses of data. What's needed isn't a defined amount of time when the enterprise's key data and systems aren't available but a zero tolerance for data and systems not being available. Existing disaster recovery thinking doesn't include that because it's deemed too expensive.
But now a young outfit called ScaleArc argues that virtualization, the cloud and new database management software mean it's time to implement active/active systems. What used to be a "recovery system" is designed under active/active to run in lockstep with the production system.
Want to learn another new way to manage data systems? See How To Rescue Your Data Analytics Program from Failure.
If the production systems go down, there is no mean time until recovery. The backup system takes over nearly instantaneously as active data is fed into it and it becomes recognized as the primary system. Customers experience a slight pause, or better yet, notice nothing amiss, and no business transactions are interrupted or customer data lost. In this view, disaster recovery has become passe and zero downtime the new normal.
In a white paper, Architecting the Active/Active Data Center produced by ScaleArc, a Santa Clara, Calif., database load balancing software supplier, writers claim the average cost for an hour of downtime is $163,674.
Furthermore, the U.S. Small Business Administration and U.S. Bureau of Labor say that 43% of businesses never reopen after a disaster; another 29% close within two years. Ninety-three percent of companies that suffer a major data loss are out of business within five years, according to this source.
The leading edge disaster recovery thinking concludes that mission critical systems, particularly transactional databases, need to be running on a standby as well as production basis. A standby system is an identical copy of the production system, carefully kept up to date and current on all software updates. But it is, nevertheless, a system that sits idle until a disaster hits. In many cases it's a virtual system on a server already doing other work, but in a different location than the primary data center. But a standby system is not a mirrored copy, with all the existing live data feeds of the primary system. They must be activated, even if the standby system is awake and running on the server or server cluster.
The ScaleArc white paper notes: "These architectures can take hours to restore operations and recover essential data following a failure. Within a single data center, line of business application recovery can take up to 30 minutes, even when adequately set up for DR." Add more time if the recovery must occur across multiple data centers and services.
The argument against moving to an active/active architecture has been cost. "For most organizations, the operational cost has been too high to justify – such a design requires sophisticated application, database and networking configuration," the white paper continued. Few organizations have implemented such an architecture, it adds.
But properly configured, active/active allows an instantaneous failover of live data feeds into databases systems that are up and running and ready to receive them. ScaleArc is a company that provides database load balancing, a background requirement for the sort of active/active systems that it's recommending. So ScaleArc is predictably a supporter of the active/active concept. At the same time, it acknowledges active/active represents a more expensive architecture and puts its own estimate on how much more expensive: you're likely to pay a 20% premium on your system costs to get there, according to ScaleArc representatives who contacted InformationWeek.
The gain with active/active mode is that a production application can be running in more than one data center, including a cloud data center, and be capable of serving application traffic at the same time as any other version of the system. The economics of disaster recovery are improved when two or more are active, sharing the load and providing scalable capacity. Nothing sits idle until the disaster happens. But there are a number of operational challenges to active/active.
It's not enough to replicate a database system to a second center. The production application must be aware of multiple database systems and know where they are. Without that awareness, the application can't know whether to route traffic to a local data center or remote one for reads and writes. Reads need to go to the nearest system, writes to the designated primary system, regardless of proximity.
Building such awareness into the application is expensive and requires revisions if an additional site becomes available, the white paper points out. Calls to the database may come from both synchronous corporate users and asynchronous third parties, some of them supporting mobile user devices. Supporting locality for third parties becomes increasingly complex.
One alternative is to rely on the database system itself, combined with database cluster software that assigns master responsibilities. This approach will create new issues, such as the data consistency of the master system and will tend to require writes to all other active systems. Both issues may impact the user experience, the white paper points out.
Active/active capability with Microsoft's SQL Server "has also proven very challenging. AlwaysOn Availability Groups in SQL Server 2012/2014 don't support location or replication awareness," the white paper said. That means the production application is offered a random database server for its connection, which may not be the closest server when the nearest server is needed to avoid latencies. Monitoring data replication across sites would also be needed to insure that random selection had current data. Such a scenario can produce data consistency or performance issues or both and such issues can be "difficult to diagnose and resolve," the white paper said. A read task would go to the first available read replica without being load balanced, meaning that task could tax one server while other read replicas sit idle.
The white paper cited the database consultancy, Pythian, as recommending the creation of a software layer that allows an IT staff to abstract away the database and application variations and adopt database load balancing to cope with replication, concurrency and locality issues. Such a step avoids having to engineer all that awareness into the application, the white paper said.
Several sources of database load balancing are available, in addition to ScaleArc, such as: F5 Networks for Oracle, HAProxy for MySQL, and Stack Overflow for multiple systems. Regardless of how an IT staff addresses the problem, an active/active system running in a primary data center and in a virtual machine at a secondary site can provide quick recovery service without needing to duplicate all hardware. Active/active can replace disaster recovery, eliminating the mean time to recovery of 30 minutes up to several hours. It can minimize or eliminate data loss as one system fails and another takes over.
These potential savings to the business warrant a review of what constitutes a disaster recovery system. Is active/active still too expensive, or can the business no longer afford to live without it? When the cost of a possible business outage is included in the outlook, the active/active approach of using virtual database servers in a secondary data center or in the cloud can lead IT managers in the direction of more resilient, survivable systems.