Overview of disaster recovery solutions for VMware vSphere
April 29, 2012 1 Comment
One of the main drivers for many organizations to start using server virtualization is the ability to simplify and perform disaster recovery. Server virtualization allows servers to be recovered on different server hardware while testing and actual recovery procedures can be highly automated. Sys-con media recently published an interesting article on DR titled Virtualization Simplifies Disaster Recovery
Still disaster recovery (DR) is for a lot of organizations that ugly duckling that never gets addressed till it’s too late. DR is often considered too expensive, risks of a total disaster are low while solving ad hoc high priority issues bugging the infrastructure is consuming all the available resources of the IT-staff leaving no time for enhancements like solid DR software and procedures. Mostly organizations which are forced to have a proper disaster protection by law (think of banks and government) and large companies have a well thought protection against disasters.
Vendors like VMware, Zerto and VirtualSharp have been offering DR software for some time which allows efficient, easy to use and cost effective disaster recovery and testing. These software solutions are increasingly adding features which enables a recovery to a secondary recovery datacenter operated by a service provider. Disaster Recovery as a Service (DRaaS) is seen as a stepping stone for organization using cloud computing. It makes DR a lot more cost effective than buying and maintaining your own secondary datacenter, hardware and software.
This posting will give an overview including features of three major solutions on disaster recovery for VMware vSphere infrastructures which are targeted at SMB, enterprises and cloud providers. The links will bring you to my earlier postings about VMware Site Recovery Manager, Zerto Virtual Replication and VirtualSharp ReliableDR.
What is disaster recovery (DR)?
DR is a set of techniques and procedures to recover an IT-infrastructure in an alternate location when the primary location is not available or not preferred because a disaster made it (partly) unavailable. A disaster could be caused by an earthquake, fire but more likely human error or major hardware or software failure.
In most cases a protection for disasters requires data to be replicated to another (offsite) location. Replication means transferring the data in the same format as the original data. A backup and restore is in most cases not sufficient for DR as restore means transferring data from the backup repository format to a virtual machine format. This takes a lot of time and probably will also mean data is not very current.
Replication of data can be done at different levels:
-at the storage level
-at the hypervisor level
-at application level. Examples are Active Directory replication, Exchange Database Availability Groups, SQL Database Mirroring.
-third party tooling Novell PlateSpin Protect
What is needed for DR?
Making sure the data is replicated to another site is one thing. More important is to verify it can recovered to a fully functional virtual infrastructure in an acceptable time and with acceptable or none lose of data. Basically three approaches are possible:
1. fully manual recovery.
Data is replicated by the storage level. In case of a disaster IT staff has to reconnect replicated volumes to ESX hosts in the recovery site. Then register the virtual machine .vmx files in vCenter and manually start the virtual machines in the correct order. This obviously needs a lot of time, is error prone and time consuming to test on a regular base.
2. semi automated recovery.
Often VMware HA is seen as a way to perform disaster recovery. Using a stretched VMware cluster, synchronous replication and storage cluster over two sites VMs can be restarted on the remaining site if one site fails. However VMware HA is not designed to recover from a failure of many hosts/a datacenter. Restart sequencing is not very granular. HA’s primary design purpose is to recover from a single host failure. Using scripts organizations might built their own orchestration but this will need maintenance. Also semi-automated recovery does not offer automated testing of DR procedures.
3 fully automated recovery (with or without DR Assurance).
In this scenario software tooling is used to fully orchestrate the steps needed to attach storage to ESX hosts, register VMs in vCenter and boot virtual machines using a runbook to make sure VMs are started in the designated order. Tooling even allows verification of the time needed to recover (RTO) and the maximum allowed loss of data (RPO).
The rest of this posting will focus on the three mostly used software tools for automated DR and DR Assurance.
VMware Site Recovery Manager (SRM) 5
VMware SRM current release is 5.0. SRM has been available since version 1.0 was released in summer 2008. It is by far market leader for DR software in VMware environments. It is also the most versatile solution available. It has a lot of features, is able to replicate using the storage layer as well as using hypervisor based replication (vSphere Replication). It can be used in the SMB space as well as in enterprise environments. Hypervisor based replication was added in the 5.0 release and this features allows for storage-agnostic replication. This means storage in the DR site can be off another vendor and brand than storage in the primary site. vSphere Replication does have some limitations. While storage based replication allows automated failback, vSphere Replication does not. Also when a protected VM has moved to another host because of VMware HA the virtual disks needs to be re-synced again. Host based replication needs to be mature and its main focus is the SMB market at the moment. Strangely enough VMware decided to not support SRM for vSphere Essentials and Essentials Plus Edition which are targeted at the SMB market.
SRM 5 is available as a Standard Edition and an Enterprise Edition. Both editions have exactly the same features, Standard Edition however is limited to protection of 75 virtual machines. Standard Edition is attractively priced with a list price of around Euro 160/ $ 213 per virtual machine available in 25-Packs.
Zerto Virtual Replication 1.0
Zerto released its Virtual Replication 1.0 version in April 2011 . It offers enterprise near-synchronous replication at the hypervisor lever. The solution does not support storage based replication but is able to deliver near-synchronous replication with an RPO of as little as a few seconds. Since its release Zerto got a lot of attention and some awards. Its unique selling point is its ability to protect any virtual machine running on any kind of VMware supported storage without the restrictions of vSphere Replication of SRM 5.0. Zerto has mature features on managing replication using compression and being able to prioritize replication of important virtual machines. Zerto is adding new features in updates on release 1.0 which are made available every couple of months. Its runbook features have been improved over the last months.
VirtualSharp ReliableDR 2.6
VirtualSharp Software is a relative unkn0wn player in the DR field. They have been around from some time offering DR Assurance to selected number of customers mainly banks and insurance companies. Recently they expanded the focus from Spain to the rest of the world. The current customer base is still small compared to SRM but the solution gets more attention from customers and the industry.
Gartner has selected VirtualSharp Software as a Cool Vendor in Gartner’s “Cool Vendors in Business Continuity Management and IT Disaster Recovery Management, 2012” report.
ReliableDR is able to replicate both on storage layer and on hypervisor. Its unique selling point is DR Assurance. It does not only verify at an infrastructure level (does your VM boot and can it be reached over the network) but more importantly, ReliableDR automatically verifies your applications on RTO and RPO. It does so by booting up your VMs in an isolated network and times when all VMs *and* applications are up and running (RTO) and if the data is current according to the RPO. This is done to send queries to databases, Exchange or webservers and verifies if an expected response (ie text) is returned or not. It is very easy to install as it only needs a single software module in the DR site and needs only one instance of vCenter Server.
Cost effective protection for disasters can be done by outsourcing the recovery infrastructure to a service provider. Disaster Recovery as a Service (DRaaS) has got a lot of attention. The service for virtual machines started about one year with a couple of US based service providers started to offer SRM protection for their customers. All three vendors now offer features which enable DRaaS. Features like rental of licenses and support for vCloud Director are available or are on the roadmap for the next release.
Which solution fits best the needs of the organization depends on budgets, requirements, size of the infrastructure etc. All three solutions have their strong unique features. The image below shows the features of the three mentioned solutions. I contacted all three vendors and supplied them a list of features which they filled in. All vendors also added some additional features.
All three offer a free trial download which can be used to test the solution.