Overview of Business Continuity/Disaster Recovery for virtual infrastructures

Business Continuity/Disaster Recovery (BC/DR) gets more attention from IT-management since more and more solutions become available to perform cost-effective DR. Cloud computing is one of the enablers of cost-effective DR. This article will give a high level overview of methods to protect virtual infrastructures (both VMware and Hyper-V) and will provide some solutions which protect the infrastructure when a disaster happens.

Any decision on what method to use for protection of the datacenter can only start when the RTO and RPO of applications are known. The RTO or Recovery Time Objective is a metric of how much time maximal can be used to recover the application in case of disaster and get it back operational. The RPO or Recovery Point Objective is a metric which says how much data maximal (in time) is acceptable to lose in case of disaster. Obviously a RTO and RPO of max 5 minutes will need more expensive components than a RTO and RPO of 48 hours.

Fundamental in DR is that data is copied to another physical location. Keeping a backup in the same datacenter but another room is an option but not wise as the whole datacenter can be destroyed (fire, earthquake,plane crash, bomb etc).

I will focus on data replication to another disk system. This is by far the fastest, most reliable option to protect the datacenter and it’s applications and data. Tape is an alternative but slow to recover data from and difficult to test on a regular basis. Let alone test this automatically as disk based relication is able to do.

There are two ways to make use of resources used for DR:

  • do it yourself. Own a second datacenter, or rent datacenter capacity (rackspace, power, cooling). Server hardware, storage is all owned by your organisation. 
  • make use of cloud computing. Send data to a cloud provider which is responsible for having the right resources for storage and for compute when a DR is tested or performed.

There are two ways to protect the primary datacenter:

  • using a cold backup. Data is being copied to an alternate location. Virtual machines cannot be started in this alternate location. In case of data lose in the primary datacenter, the data needs to be copied back from the alternate to the primary datacenter.
  • using a warm or hot backup. Data including virtual machine data is copied to an alternate location. Virtual machines and applications can be made operational in the alternate datacenter. Warm backup will mean hours to day recovery time, hot backup will mean recovery of an hour or less.

Replication of data can be performed using various methodes:

  • array based replication (hardware or software initiated)
  • guest/os based replication
  • hypervisor based replication /host based replication  
  • application level relication
  • replication of backup data

Compression and data deduplication.
Data sent over wan connections can be compressed and also quality of service can be set so replication does not interfer with business application network traffic.

HyperIP of Netex is an example of a WAN optimization solution. It is often used in combination with Veeam Backup & Replication.

Data deduplication technology identifies and eliminates redundant data segments so that backups consume significantly less storage capacity. It lets organizations hold onto months of backup data to ensure rapid restores (better recovery time objective [RTO]) and lets them back up more frequently to create more recovery points (better recovery point objective [RPO]). Companies also save money by using less disk capacity and by optimizing network bandwidth.

There are many solutions on the market which offers deduplication compatible with VMware.

Features you might want to have for DR and testing of your DR procedures.

  • one button automated disaster recovery. By pressing one button an automated and orchestrated failover of the primary datacenter will start. Virtual machines will start in the recovery site using a predefined run-book. High priority VM’s like AD, DNS will start first. Then databases, then application servers. All performed without manual intervention
  • one button failback. When the primary datacenter has been restored virtual machines running in the recovery site are moved back to the primary datacenter automatically.
  • automated, scheduled disaster recovery testing. To make sure the DR procedures actually result in operational virtual machines in the recovery site you want to test this regulary. Reporting about the status is a nice extra feature. Production vm’s are not affected by the test.
  • planned migration. Situations are possible in which a datacenter needs to be evacuated, but not instantly. Think about an expected hurricane, or planned downtime because of maintenance on a core switch or important storage array. In that case you want a clean shutdown of virtual machines which are then moved to the alternate datacenter.
  • application consistent replication. When replication is performed, you want data (databases) to be consistent so no data is lost.
  • guarantee that RTO and RPO are meet. Confirmation a virtual machine is up and running does not mean the application/services are in a state they are expected to be. You might want to ensure the services are recovered according the RTO and dataloss is better than the RPO.
  • grouping of VM’s. You might want to have the same policies set for a group of protected VM’s. Like replication schedule.

Quite  a few solutions are available to perform diaster recovery. This section will mention some. In future blogpostings I will go into more detail about solutions and their features and will compare solutions.

Array-based replication (done by array)

Available in synchronous and a-synchronous replication. Example of synchronous replication (not data loose) is HP Storageworks P4000 serie. Using network RAID10 data is redundant written oto two nodes simultaneously. Each nodes can be located in a different datacenter. Connection between both datacenter has a requirement of around 2-3 ms latency max. This is an expensive solution because only 50% or raw storage capacity can be used for data storage, and because of fiber connection. Also disk in both sites need to have the same specs (both SAS for instance). A VMware metro-cluster can be created. Advantage is that no manual work needs to be done to present LUNs to VMware hosts in case of a failover. Automation of DR in case of an unplanned failure is not possible. VM’s will restart by VMware HA using priority settings configurered at the cluster level.  

Using a-synchronous replication volumes can be replicated to another location. If the primary site is down, the replicated volume in the recovery site can be made operational. Some data will be lost. Some manual work needs to be done to present LUN’s to hosts and register virtual machines into vCenter Server or SCVMM/Hyper-V manager. Restart needs to be performed all manually.

To automate and orchestrate DR for Hyper-V environments, NetApp has created Powershell scripts and is using System Center Orchestrator (former Opalis) to perform DR automation.  

A disadvantage of array-based replication is that the storage array in the primary site needs to be of the same vendor as the storage array in the recovery site. Also replication is configurered per LUN. Administrators always needs to make sure protected VM’s are on the replicated LUN’s.

Array-based replication (software initiated)

Besides the array replication described above, also software solutions are available which use array-based replication but add value to the solution. For VMware environments Site Recovery Manager is well known. Another solution is VirtualSharp ReliableDR. Both use the storage layer features to relicate data. Added on that are features to perform DR testing, automation, configure priority, re-configure IP-configuration on vm’s which are failed over etc.

VirtualSharp goes a step further.It not only checks if a VM is booted in the recovery site. It also checks if an application or service (group of VMs) are functional and if RTO and RPO are met according to the SLA. See this post about what VirtualSharp ReliableDR has to offer for DR.

guest/os based replication
Replication is executed by an agent in the virtual machine or triggered by a backup server on a per vm base. Examples are Veeam Backup & Replication, Novell PlateSpin Forge and PlateSpin Protect, Veritas Volume Relicator and Double-Take Availability.  Most solutions will do for recovery of a single or limited number of VM’s but not for recovery of a large environment. There is no automated and orchestrated recovery possible.

hypervisor replication and host based replication 
These forms of replication are typical for virtual environements. Here an appliance is installed on each VMware host. Zerto Virtual Replication is a new product lauched at August 2011. It clones write i/o of virtual machines, compresses it and sends the data to a recovery site.  It does not need array-based replication and is much easier to configure than array-based replication solutions.

Starting version 5.o of Site Recovery Manager (available in September 2011) replication at host  level is also possible. In this scenario replication is not done by the storage array but by a software appliance. This new SRM 5.0 feature is called vSphere Replication. See this post what is new in SRM 5.0. SRM has some drawbacks like it cannot be used on datastores which are part of a Storage DRS datastore cluster.

VirtualSharp ReliableDR also has host based replication.   

Windows Server 8 with Hyper-V R3 will have a feature called Hyper-V Replica. Each VM can be configuered to replicate its data to another Hyper-V server which runs in a recovery site.   

Application level replication
Probably the easiest to install and manage method of replication. Services like Microsoft Active Directory and  Exchange 2010 replicate data themselves to antother instance. However most applications do not have replication features so you will need one of the methods mentioned above.

Replication of backup data
Backup data of virtual machines can be replicated to a recovery site. Microsoft DPM can replicate it’s backup data. DPM is not VMware aware so it will treat a VMware vm as a physical server. Veeam Backup & Replication backup files (VBK etc) can be replicated using third party solutions. Free solution often used is rsync. Mind Veeam by default changes the filename of the VBK after each run. To prevent this rename so rsync only replicates the changed blocks inside files, set a registry entry explained here.

Disaster Recovery to the cloud
Cloud computing has many advantages. It is cost-effective, scaleable and has a pay per use model. Combined with solutions which offer host based replication Disaster Recovery as a Service or DRaaS is a vert appealing option to do your DR.  

This posting titled How The Cloud Changes Disaster Recovery by Mike Klein explains conventional disaster recovery versus disaster recovery to the cloud. Good read. It shows the black arrow for conventional and the red arrow for DR in the cloud.

ZDnet.com has a good article titled Disaster Recovery is your transitional Cloud step

Virtualizationreview.com has a great webinar on the subject of DR for VMware sponsored by Virtacore, PHD Virtual, Zerto and Veeam

You can view the event at your convenience until October 19, 2011.

This event explored how to best implement disaster recovery in your VMware environment and virtual infrastructure. Get all the details.

see this link for more info.


About Marcel van den Berg
I am a technical consultant with a strong focus on server virtualization, desktop virtualization, cloud computing and business continuity/disaster recovery.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: