Checking hardware recommendations might prevent VSAN nightmare.

<update June 4>

Jason Gill posted the Root Cause Analysis done by VMware on his issue with VMware described below. Indeed the issue was because of the usage of the Dell PERC H310 controller which has a very low queue depth. A quote:

While this controller was certified and is in our Hardware Compatibility List, its use means that your VSAN cluster was unable to cope with both a rebuild activity and running production workloads. While VSAN will throttle back rebuild activity if needed, it will insist on minimum progress, as the user is exposed to the possibility of another error while unprotected. This minimum rebuild rate saturated the majority of resources in your IO controller. Once the IO controller was saturated, VSAN first throttled the rebuild, and — when that was not successful — began to throttle production workloads.

Read the full Root Cause Analysis here at Reddit

Another interesting observation while reading the thread on Reddit is that the Dell PERC H310 actually is an OEM version of the LSI 2008 card. John Nicholson wrote a very interesting blog about the H310 here.

Dell seems to use H310 with old firmware. When using the latest firmware the queue depth of the Dell PERC H310 can be increased to 600!

We went from 270 write IOPS at 30 ms of write latency to 3000 write iops at .2ms write latency just by upgrading to the new firmware that took queue depth from 25 to 600

This article explains how to flash a Dell PERC H310 with newer firmware. I am not sure if a flashed PERC H310 is supported by VMware. As a HBA with better specs is not that expensive I advise to only flash Dell PERC H310 when used in non-production environments.

————————————————————-

June 02, 2014

An interesting post appeared on Reddit. The post titled My VSAN nightmare describes a serious issue in a VSAN cluster. When one of the three storage nodes failed displaying a purple screen, initially all seemed fine. VMware HA kicked in and restarted VM’s on the surviving nodes (two compute and two storage nodes). The customer was worried about redundancy as storage was located on just two nodes now. So SSD and HDD storage was added to one of the compute nodes. This node did not have local storage before.

However exactly 60 minutes after adding new storage,  DRS started to move VM’s to other hosts, lots of IO were seen, all (about 77) VM’s became unresponsive and all died. VSAN Observer showed that IO latency had jumped to 15-30 seconds (up from just a few miliseconds on a normal day).

VMware support could not solve the situation and basically said to the customer: “wait till this I/O storm is over”. About 7 hours later the critical VM’s were running again. No data was lost.

At the moment VMware support is analyzing what went wrong to be able to make a Root Cause Analysis.

Issues on VSAN like the one documented on Reddit are very rare.  This post will provide some looks under the cover of VSAN. Hope this helps to understand what is going on under the hood of VSAN and it might prevented this situation happening to you as well.

Lets have a closer look at the VSAN hardware configuration of the customer who wrote his experiences on Reddit.

VSAN hardware configuration
The customer was using 5 nodes in a VSAN cluster: 2x compute nodes (no local storage )  and 3x storage nodes, each with 6x magnetic disks and 2x SSD’s, split into two disk groups each.
Two 10 Gb nics where used for VSAN traffic. A Dell PERC H310 controller was used which has a queue depth of only 25. Western Digital WD2000FYYZ HDDs were used with a capacity of 2 TB, 7200 rpm SATA drives. SSD’s are Intel DC S3700 200 GB.

The Dell PERC H310  is interesting as in Duncan Epping post here it is stated:

Generally speaking it is recommended to use a disk controller with a queue depth > 256 when used for VSAN or “host local caching” solutions

VMware VSAN Hardware Guidance also states:

The most important performance factor regarding storage controllers in a Virtual SAN solution is the supported
queue depth. VMware recommends storage controllers with a queue depth of greater than 256 for optimal
Virtual SAN performance. For optimal performance of storage controllers in RAID 0 mode, disable the write cache, disable read-ahead,
and enable direct I/Os.

Dell states about the Dell PERC H310

 Our entry-level controller card provides moderate performance.

Before we dive into the possible cause of this issue lets first provide some basics on VMware VSAN. Both Duncan Epping and Cormac Hogan of VMware wrote some great posting about VSAN. Recommended reads! See the links at the end of this post.

VSAN servers 
There are two ways to install a new VSAN server:

  1. assemble one yourself using components listed in the VSAN Hardware Compatibility Guide
  2. use one of the VSAN Ready Nodes which can be purchased. 16 models are available now from various vendors like Dell and Supermicro.

Dell has 8 different servers listed as VSAN Ready Node. One of them is the PowerEdge R720-XD which is the same server type used by the customer describing his VSAN nightmare. However the Dell VSAN Ready Node has 1 TB NL-SAS HDD while the Reddit case used 2 TB SATA drives. So likely he was using servers assembled himself.

Interesting is that 4 out of the 8 Dell VSAN Ready Node server use the Dell PERC H310 controller. Again, VMware advises a controller with a queue depth of over 250 while the PERC H310 has 25.

Dell-vsan-ready-node

VSAN storage policies
For each virtual  machine or virtual disk active in a VSAN cluster an administrator can set ‘virtual machine storage policies’. One of the available storage policies is named ‘number of failures to tolerate’. When set to 1, virtual machines to which this policy is set will survive a failure of a single disk controller, host or nic.

VSAN provides this redundancy by creating one or more replica’s of VMDK files and stores these at different storage nodes in a VSAN cluster.

In case a replica is lost, VSAN will initiate a rebuild. A rebuild will recreate a replica of VMDKs.

VSAN response to a failure. 

VSAN’s response to a failure depends on the type of failure.
A failure of SSD, HDD or the diskcontroller results in an immediately rebuild. VSAN understand this is a permanent failure which is not caused by for example planned maintenance.

A failure of the network or host results in a rebuild which is initiated after a delay of 60 minutes. This is the default wait. The wait is because the absense of a host or network could be temporary (maintenance for example) and prevents wasting resources. Duncan Epping explains details in this post How VSAN handles a disk or host failure .
The image below was taken from this blog.

If the failed component returns within 60 minutes only a data sync will take place. Here only the data changed during the absence will be copied over to the  replica(s).

A rebuild however means that a new replica will be created for all VMDK files being not compliant. This is also referred to as a ‘full data migration’.

To change the delay time see this VMware KB article Changing the default repair delay time for a host failure in VMware Virtual SAN (2075456)

Control  and monitor VSAN rebuild progress
At the moment VMware does not provide a way to control and monitor the progress of the rebuild process. In the case described at Reddit basically VMware advised ‘wait and it will be alright’. There was no way to predit for how long the performance of all VM’s stored on VSAN would be badly affected because of the rebuild. The only way to see the status of a VM is by clicking on a VM in the vSphere web client. Then select its storage policies tab, then clicking on each of its virtual disks and checking the list – it will tell you “Active”, “Reconfiguring”, or “Absent”

For monitoring  VSAN Observer provides insight on what is happening.

Also looking at the clomd.log could give indication of what is going on. This is the logfile of the Cluster Level Object Manager (CLOM)

It is also possible to use command line tools for administration, monitoring and troubleshooting. VSAN uses Ruby vSphere Console (RVC) command line. Florian Grehl wrote a few  blogs about managing VSAN using RVC

The VMware VSAN Quick Troubleshooting and Monitoring Reference Guide has many details as well.

Possible cause
It looks like the VSAN rebuild process which started exactly 60 minutes after having added extra storage initiated the I/O storm. VSAN was correcting an incompliant storage profile and started to recreate replica’s of VMDK objects.

A possible cause for this I/O storm could be that the rebuild of almost all VMDK files in the cluster was executed in parallel.  However according to Dinesh Nambisan working for the VMware VSAN product team;

 “VSAN does have an inbuilt throttling mechanism for rebuild traffic.”

VSAN seems to use  a Quality of Service system for throttling back replication traffic. How this exacty works and if this is controlable by customers is unclear. I am sure we will soon learn more about this as this seems key in solving future issues with low-end controllers and HDDs combined with a limited number of storage nodes.

While the root cause has yet to be determined a combination of configuration choices could have caused this:

1. Only three servers in the VSAN cluster were used for storage. When 1 failed only two were left. Those two both were active in rebuild for about 77 virtual machines at the same time.
2. Using SATA 7200 rpm drives as the HDD persistent storage layer. Fine for normal operations when SSD is used for cache. In a rebuild operation not the most powerfull drives having low queue depths.
3. Using an entry level Dell PERC H310 disk controller. The queue depth of this controller is only 25 while advised is to use a controller with 250+ queue depth.

Some considerations
1. Just to be on the safe side use controllers with at least 250+ queue depth
2. for production workloads use N+2 redundancy.
3. use NL-SAS drives or better hdd. These have much higher queue depths (256) compared to SATA hdd (32).
4. in case of a failure of a VSAN storage node: try to fix the server by swapping memory/components to prevent rebuilds. A sync is always better than a rebuild.

5. It will be helpfull if VMware added more control for the rebuild process. When n+2 is used, rebuild could be scheduled to be executed only during non-business hours. Also some sort of control of priority on which replica’s are rebuild first would be nice. Something like this:

in case n+1: tier 1 vms rebuild after 60 minutes. tier 2,3  rebuild during non-business hours
in case n+2: all rebuilds only during non-business hours. Tier 1 vm’s first, then tier 2 then tier 3 etc.

Some other blogs about this particular case
Jeramiah Dooley Hardware is Boring–The HCL Corollary

Hans De Leenheer VSAN: THE PERFORMANCE IMPACT OF EXTRA NODES VERSUS FAILURE

Some usefull links providing insights into VSAN

Jason Langer : Notes from the Field: VSAN Design–Networking

Duncan Epping and others wrote many postings about VSAN. Here a complete overview.

A selection of those blog posts which are interesing for this case.
Duncan Epping How long will VSAN rebuilding take with large drives?
Duncan Epping 4 is the minimum number of hosts for VSAN if you ask me
Duncan Epping How VSAN handles a disk or host failure
Duncan Epping Disk Controller features and Queue Depth?

Cormac Hogan VSAN Part 25 – How many hosts needed to tolerate failures?

Cormac Hogan Components and objects  and What is a witness disk 

 

About Marcel van den Berg
I am a technical consultant with a strong focus on server virtualization, desktop virtualization, cloud computing and business continuity/disaster recovery.

8 Responses to Checking hardware recommendations might prevent VSAN nightmare.

  1. Erik Stolcher says:

    1) How many clustered servers does one need to run 77 VMs?
    2) With the low IO this user has in his environment why would he need SAS drives?
    2) Dell’s Controller is on the HCL. If VMware believes this controller is inadequate then it shouldn’t be on it.

    Bottom line. A server died, the user added capacity on a new node and the rebalancing created havoc for several hours disrupting the business.

    Furthermore, SATA usage is widespread among storage arrays but the rebuilt results are more graceful than this. So pointing the finger to the user saying he should have checked the HCL when it’s clear this has nothing to to do with the HCL is certainly a slap in the face to him.

    Have a nice day
    Erik

  2. I agree adding a controller to the HCL while it has far less than recommended number of queue depth is confusing. However the HCL indicates a component has been tested and is supported by VMware. It is not a gurantee for performance.

    It is a bit like a drivers license. If you have on you are allowed to drive a car (you are on the HCL). However you cannot blaim the car manufacturer for crashing the car into a tree because the driver was drunk. (not recommended practice).

    It will be interesting to learn why the rebuild caused VMs to halt as VSAN is supposed to throttle the rebuild.

  3. Erik Stolcher says:

    The driver didn’t crash his server intentionally. He just added capacity to an existing server. It would be no different than adding a shelf of disks to a storage array. If the architecture can’t gracefully handle this then it’s not the “driver’s fault”.

    To me the driver doesn’t appear to have been “drunk” but he does appear to have succumb to the kool-aid effect and hype of “here press this button and viola you now have a server SAN and you need not do anything else. It’s that Simple” when in fact it’s abundantly clear it’s more complicated than that.

    • Hey if some architect/design time wasn’t a good idea then I wouldn’t have a day job so I’m not going to completely disagree with you here 🙂

      BTW, thanks for the link back, and you did a solid job fleshing out in greater detail my thoughts on this. VMware made it abundantly clear that ready nodes where bundles vendors had selected with components from the HCL, they were not necessarily specially extra bench-marked configs.

      VSAN actually ran alright on those 25 queue deth HBA (I had some in the lab, and ran it for a while) Its only when you needed to burst a lot of writes that things got weird. Rebuilds were not even that bad when your operating in a low IO enviroment (like my lab). its only when you have load (77 VM’s isn’t a tiny workload for what ended up being 2 nodes of storage after he lost the +1). I’ve heard of people having performance issues with Nutanix, and flash tiered arrays for similar undersized/design issues so this isn’t magically a VSAN issue. Those of us who’ve been battling target queue depth on FC arrays or path limitations on EqualLogics are no stranger to all the crazy ways a perfectly decent design can be crippled.

  4. The ‘drunk and drive’ was a metaphor for using a disk controller with far less queue depth than recommended. The metaphor was not for adding additional storage because of worries for redundancy. I can perfectly understand the intention of that.

    The purpose of my post was exactly as you state: VSAN looks simple but you have to be aware of your design,choice of hardware, risk versus budget and procedures.

    This case is an example of a typical Swiss Cheese model
    http://en.wikipedia.org/wiki/Swiss_cheese_model

    Several situations which come together at the same time have probably caused this.
    Nobody is to blame, lets all learn from this.

  5. n+2 won’t really help that much other than adding more nodes to the cluster, hence spreading the load of redistributing to one more node. The only thing n+2 gives you is the ability to evacuate a node for maintenance without the risk of running against a data-loss in case of a failure while doing maintenance.

    The more I think about this case, the more I think VMware has nothing to blame here and neither does the hardware vendor (in this case Dell). Think about Windows 8.1 that has a minimum requirement of 1Ghz/1GB. Would you be surprised that runs like sh*t if you actually used that? Compare that to using the lowest entry HBAs on a distributed storage stack with a 10GbE backbone and SSDs, distributing 77 VMs at the same time.

    That being said a better throttling/QOS mechanism probably would make sense where the result would be a very long rebuild time other than bringing the whole system to its knees.

  6. took the liberty to combine my thoughts on n+2 and performance impact in a blogpost: http://hansdeleenheer.com/vsan-the-performance-impact-of-extra-nodes-versus-failure/

  7. Chris says:

    Enjoyed the read, thanks.

    PS: search for “save site” and replace with “safe side”

Leave a reply to Chris Cancel reply