What are the dangers of snapshots and how to avoid?

VMware vSphere snapshots can be very useful. A snapshot captures just like a photo  does the state of a virtual machine at a certain point in time. This capture cannot be modified while the virtual machine is active as it is read only. Returning to a state which is known to be good is a matter of a few mouseclicks.

However, snapshots are not that innocent. You can shoot yourself in the foot when not realizing the side effects of snapshots.

Backup software is a major culprit for causing issues with virtual machine performance and availability due to using snapshots. See my previous post about the impact of snapshots.

It is very important to understand what the impact of snapshots can be on availability and performance of virtual machines:

  • a virtual machine with active snapshot(s) performing many writes to disk can fill up capacity of a datastore causing all vm’s on a datastore to crash or pause
  • deleting a snapshot can pause a virtual machine for many minutes. This can for example result in Exchange Server DAG cluster failover or other unwanted side effects.

This post will provide information on snapshot deletions (commit as well as consolidation) and how to prevent pausing of virtual machines. We will focus on VMware snapshots but much in this post applies to snapshots of other solutions as well.

A couple of advises:

  • make sure if your application supports snapshots and under which conditions
  • a succesfull backup using snapshots does not automatically mean a succesfull restore!
  • snapshots are not a replacement for backups!
  • make sure snapshots are only active  for a couple of hours max. Then delete snapshots.
  • be very carefull using snapshots on virtual machines which perform many write transactions to disk
  • have a close look at impact and behaviour of your backup tool on snapshot files
  • make sure applications running in the virtual machine support snapshots. Snapshots of virtual machines running Microsoft Exchange are not supported. Snapshots of SQL server are supported only when VSS is used.
  • snapshots of virtual machines using in-guest iSCSI drives are not supported. 

 

Introduction to VMware snapshots

When a snapshot is made, the original VMDK (we call this the parent or base disk) is set to ReadOnly mode. All further writes to the virtual machine disks are stored in a delta disk (also called snapshot disk,  child disk, virtual  disk redologs or (sparse) delta disks). These delta disks have a <number> -delta.vmdk extension in the filename. Snapshots grow in chunks of 16 MB each. Each time a chunk is added the VMFS volume is locked.

Muliple snapshots can be taken of the same virtual machine.

Snaphots are very useful for making sure a known working situation can be restored. This because the parent disk does not change after the snapshot was taken (it is read only).

When a snapshot is deleted (we do not want to revert to the original situation when the snapshot was made), ESX(i) will merge the data written in the delta file back to the parent disk. A snapshot delete is also called a ‘commit‘ or ‘consolidation’.

While this is in progress, another delta disk is created which is used during the commit to store new writes. This is a ‘Consolidate Helper snapshot’  It is created at the moment a snapshot file is being commited to the parent disk. New incoming writes are stored in the consolidate helper snapshot file. Those are commited as well when the initial snapshot file has been succesfully commited.

 

 

 

 

 

 

 

To keep track of snapshot files ESX(i) uses a .vmsd file which is used for storing information and metadata about snapshots.

If an administrator wants to restore a certain state of a virtual machine (go back in time) , this is called revert.

This is a great article explaining what is happening under the hood of snapshots. This VMware KB article is also very informative.

Microsoft SQL and Exchange  support for snapshots 

Mind not all applications support snapshots. Microsoft policy on snapshots depends on the product. SQL Server supports snapshots which uses the VSS. This is the support policy for SQL Server.

SQL Server supports virtualization-aware backup solutions that use VSS (volume snapshots). For example, SQL Server supports Hyper-V backup. Virtual machine snapshots that do not use VSS volume snapshots are not supported by SQL Server. Any snapshot technology that does a behind-the-scenes save of a VM’s point-in-time memory, disk, and device state without interacting with applications on the guest using VSS may leave SQL Server in an inconsistent state.

Exchange Server (2010 and 2013) does not support snapshots. The quote below was taken from a Microsoft article . This is an article on Exchange 2013.

Some hypervisors include features for taking snapshots of virtual machines. Virtual machine snapshots capture the state of a virtual machine while it’s running. This feature enables you to take multiple snapshots of a virtual machine and then revert the virtual machine to any of the previous states by applying a snapshot to the virtual machine. However, virtual machine snapshots aren’t application aware, and using them can have unintended and unexpected consequences for a server application that maintains state data, such as Exchange. As a result, making virtual machine snapshots of an Exchange guest virtual machine isn’t supported.

Especially when snapshots are taken of Exchange mailbox servers care should be taken. Snapshots of HUB and CAS roles should be okay in most cases.

If you make a snapshot of an Exchange Server and want to revert, there is a chance that after revert you will notice Exchange errors. If you are in bad luck Exchange might not be able to mount mailboxstores because of corruption.

When the snapshot is commited there is a chance the virtual machine has to be paused for a while. When an Exchange DAG role is installed in that virtual machine a DAG cluster failover might occur because the heartbeat is temporary lost.

If you want to make a snapshot of an Exchange server, make sure the virtual machine is shutdown first.

Snapshots of Microsoft Active Directory running on Windows Server 2012 are supported on certain versions of the hypervisor. See my post for more info. See this Microsoft post for additional info.

Out of sync situation

Snapshots are used by many backup solutions. However not all backup solutions clean up the delta disks after the backup of a vm has finished. Some tools just delete the metadata while the delta disks are still used to write data to.

VMware introduced in vSphere 5.0 ‘snapshot consolidation’. This corrects out-of-sync situations like a leftover snapshot file. Snapshot consolidation commits a chain of snapshot files to the original virtual machine parent file when Snapshot Manager shows that no snapshots exists but the delta files still remain on the datastore.

Snapshot consolidation is a very important task for administrators. Because snapshot files are still active these continue to expand and consume disk space untill the datastore runs out of space.

How do you know a virtual machine disk needs consolidation? It will be shown in the Summary tab of the vSphere Client.

vspehre-consolidation

A small explanation of how to use consolidation is shown in this VMware video.

Slow or paused virtual machines due to commit
Consolidation and snapshot commits could lead to a situation in which the virtual machine is paused for a few seconds or up to over 30 minutes!

This pausing is called a stun and is in certain circumstances required to be able to commit delta files.
Stunning is likely to happen when the guest operating system is performing more writes to the delta file than ESX(i) can commit to the parent disk. It is like a car driving max 50mph is trying to overtake a car driving an average of 60mph. To be overtaken the fastest car will need to stop or slow down for a while.

ESX(i) stun is a pause of the virtual machine so snapshots files can be commited to the parent disk. More info on stun in this VMware KB article.

VMware made several enhancements to snapshot commits in various releases of vSphere but still snapshots can have severe impact on virtual machines.

ESX(i) will try to commit snapshots without having to stun (pause) the virtual machine. Performing snapshot commits while the virtual machine is running is called asynchronous consolidate.

Initially the commit is performed during a period of 5 minutes. If this commit fails to get rid of all snapshot files because to many writes are coming in, it will do another try with a duration of 10 minutes. If this again fails because too much new writes are written, the snapshot commit duration is extended to 20 minutes. In total ESX(i) tries at a maximum of 10 times (called iterations).

Thereafter the virtual machine will be stuned. This is called a synchronous consolidate. Stunning means no new writes are coming in and ESX(i) is able to commit all snapshot files.

Beginning in ESXi 5.0, the snapshot stun times are logged. Each virtual machine’s log file (vmware.log) will contain messages similar to:

2013-03-23T17:40:02.544Z| vcpu-0| Checkpoint_Unstun: vm stopped for 403475568 us

In this example, the virtual machine was stunned for 403475568 microseconds (1 second = 1 million microseconds).

Avoiding stun or keep the stun duration as short as possible
If you do not want to stun / pause the virtual machine you can set  snapshot.maxIterations to 20 (or higher). This means vSphere will do more tries (iterations) to commit the snapshot files. More information in this KB article.

Be carefull to change settings and closely monitor the effects.

To do this:

  1. Shut down the virtual machine
  2. Right-click the virtual machine and click Edit Settings.
  3. Click the Options tab.
  4. Under Advanced, click General.
  5. Click Configuration Parameters and add snapshot.maxIterations

However, this could make things worse. Think again about that car (the commit process) trying to chase that other (leading) car (the writes from the os and applications in the guest). If the speed of the leading car remains higher than the chasing car, the longer the duration of the chase, the bigger the distance.

Alternatively you can set snapshot.maxConsolidateTime to 60 seconds. This means you can accept a pause of the virtual machine for 60 seconds to do a synchronous consolidate. This is often a better option than wait for the snapshot file grow so big the virtual machine will require to be stunned for a much longer time.

ESXi 4.1 has a update which added parameter snapshot.asyncConsolidate.forceSync = “FALSE” which needs to be added to the VMX file. This setting disables synchronous consolidate and the virtual machine will never be stunned. More info in this KB.

 

Some additional info
VMware published a remarkable number of knowledgebase articles on snapshots. Below just some examples.

VMware KB A snapshot removal can stop a virtual machine for long time (1002836)
VMware KB Virtual machines residing on NFS storage become unresponsive during a snapshot removal operation (2010953)
V
Mware KB Delete all Snapshots and Consolidate Snapshots feature FAQ (1023657)
V
Mware KB Commands to monitor snapshot deletion in ESX 2.5/3.x/4.x and ESXi 3.x/4.x/5.x (1007566)
V
Mware KB Consolidating snapshots in vSphere 5.x (2003638)
V
Mware KB Configuring VMware vCenter Server to send alarms when virtual machines are running from snapshots (1018029)

Advertisements

About Marcel van den Berg
I am a technical consultant with a strong focus on server virtualization, desktop virtualization, cloud computing and business continuity/disaster recovery.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: