19 controllers removed from VMware Virtual SAN (VSAN)compatibility list

VMware announced in this KB article 19 discontrollers initially listed on the VSAN compatibility list have been removed from this list. This makes the controllers unsupported for usage in VSAN configurations.

The reason for this removal is according to VMware’s KB article:

As part of VMware’s ongoing testing and certification efforts on Virtual SAN compatible hardware, VMware has decided to remove these controllers from the Virtual SAN compatibility list. While fully functional, these controllers offer too low IO throughput to sustain the performance requirements of most VMware environments. Because of the low queue depth offered by these controllers, even a moderate IO rate could result in IO operations timing out, especially during disk rebuild operations. In this event, the controller may be unable to cope with both a rebuild activity and running Virtual Machine IO causing elongated rebuild time and slow application responsiveness. To avoid issues, such as the one described above, VMware is removing these controllers from the Hardware Compatibility List.

 

One of the controlers removed is the Dell PERC H310. Likely one of the reasons this controller has been deleted is a serious issue described at Reddit.com. A VSAN user experienced basically a meltdown of its virtual infrastructure when a VSAN node failed and after a while a rebuild started. This caused so many IO the PERC H310 could not handle and all VMs came to a standstill.

These are all controllers removed from the VSAN compatibility list

vsan-controllers

Checking hardware recommendations might prevent VSAN nightmare.

<update June 4>

Jason Gill posted the Root Cause Analysis done by VMware on his issue with VMware described below. Indeed the issue was because of the usage of the Dell PERC H310 controller which has a very low queue depth. A quote:

While this controller was certified and is in our Hardware Compatibility List, its use means that your VSAN cluster was unable to cope with both a rebuild activity and running production workloads. While VSAN will throttle back rebuild activity if needed, it will insist on minimum progress, as the user is exposed to the possibility of another error while unprotected. This minimum rebuild rate saturated the majority of resources in your IO controller. Once the IO controller was saturated, VSAN first throttled the rebuild, and — when that was not successful — began to throttle production workloads.

Read the full Root Cause Analysis here at Reddit

Another interesting observation while reading the thread on Reddit is that the Dell PERC H310 actually is an OEM version of the LSI 2008 card. John Nicholson wrote a very interesting blog about the H310 here.

Dell seems to use H310 with old firmware. When using the latest firmware the queue depth of the Dell PERC H310 can be increased to 600!

We went from 270 write IOPS at 30 ms of write latency to 3000 write iops at .2ms write latency just by upgrading to the new firmware that took queue depth from 25 to 600

This article explains how to flash a Dell PERC H310 with newer firmware. I am not sure if a flashed PERC H310 is supported by VMware. As a HBA with better specs is not that expensive I advise to only flash Dell PERC H310 when used in non-production environments.

————————————————————-

June 02, 2014

An interesting post appeared on Reddit. The post titled My VSAN nightmare describes a serious issue in a VSAN cluster. When one of the three storage nodes failed displaying a purple screen, initially all seemed fine. VMware HA kicked in and restarted VM’s on the surviving nodes (two compute and two storage nodes). The customer was worried about redundancy as storage was located on just two nodes now. So SSD and HDD storage was added to one of the compute nodes. This node did not have local storage before.

However exactly 60 minutes after adding new storage,  DRS started to move VM’s to other hosts, lots of IO were seen, all (about 77) VM’s became unresponsive and all died. VSAN Observer showed that IO latency had jumped to 15-30 seconds (up from just a few miliseconds on a normal day).

VMware support could not solve the situation and basically said to the customer: “wait till this I/O storm is over”. About 7 hours later the critical VM’s were running again. No data was lost.

At the moment VMware support is analyzing what went wrong to be able to make a Root Cause Analysis.

Issues on VSAN like the one documented on Reddit are very rare.  This post will provide some looks under the cover of VSAN. Hope this helps to understand what is going on under the hood of VSAN and it might prevented this situation happening to you as well.

Lets have a closer look at the VSAN hardware configuration of the customer who wrote his experiences on Reddit.

VSAN hardware configuration
The customer was using 5 nodes in a VSAN cluster: 2x compute nodes (no local storage )  and 3x storage nodes, each with 6x magnetic disks and 2x SSD’s, split into two disk groups each.
Two 10 Gb nics where used for VSAN traffic. A Dell PERC H310 controller was used which has a queue depth of only 25. Western Digital WD2000FYYZ HDDs were used with a capacity of 2 TB, 7200 rpm SATA drives. SSD’s are Intel DC S3700 200 GB.

The Dell PERC H310  is interesting as in Duncan Epping post here it is stated:

Generally speaking it is recommended to use a disk controller with a queue depth > 256 when used for VSAN or “host local caching” solutions

VMware VSAN Hardware Guidance also states:

The most important performance factor regarding storage controllers in a Virtual SAN solution is the supported
queue depth. VMware recommends storage controllers with a queue depth of greater than 256 for optimal
Virtual SAN performance. For optimal performance of storage controllers in RAID 0 mode, disable the write cache, disable read-ahead,
and enable direct I/Os.

Dell states about the Dell PERC H310

 Our entry-level controller card provides moderate performance.

Before we dive into the possible cause of this issue lets first provide some basics on VMware VSAN. Both Duncan Epping and Cormac Hogan of VMware wrote some great posting about VSAN. Recommended reads! See the links at the end of this post.

VSAN servers 
There are two ways to install a new VSAN server:

  1. assemble one yourself using components listed in the VSAN Hardware Compatibility Guide
  2. use one of the VSAN Ready Nodes which can be purchased. 16 models are available now from various vendors like Dell and Supermicro.

Dell has 8 different servers listed as VSAN Ready Node. One of them is the PowerEdge R720-XD which is the same server type used by the customer describing his VSAN nightmare. However the Dell VSAN Ready Node has 1 TB NL-SAS HDD while the Reddit case used 2 TB SATA drives. So likely he was using servers assembled himself.

Interesting is that 4 out of the 8 Dell VSAN Ready Node server use the Dell PERC H310 controller. Again, VMware advises a controller with a queue depth of over 250 while the PERC H310 has 25.

Dell-vsan-ready-node

VSAN storage policies
For each virtual  machine or virtual disk active in a VSAN cluster an administrator can set ‘virtual machine storage policies’. One of the available storage policies is named ‘number of failures to tolerate’. When set to 1, virtual machines to which this policy is set will survive a failure of a single disk controller, host or nic.

VSAN provides this redundancy by creating one or more replica’s of VMDK files and stores these at different storage nodes in a VSAN cluster.

In case a replica is lost, VSAN will initiate a rebuild. A rebuild will recreate a replica of VMDKs.

VSAN response to a failure. 

VSAN’s response to a failure depends on the type of failure.
A failure of SSD, HDD or the diskcontroller results in an immediately rebuild. VSAN understand this is a permanent failure which is not caused by for example planned maintenance.

A failure of the network or host results in a rebuild which is initiated after a delay of 60 minutes. This is the default wait. The wait is because the absense of a host or network could be temporary (maintenance for example) and prevents wasting resources. Duncan Epping explains details in this post How VSAN handles a disk or host failure .
The image below was taken from this blog.

If the failed component returns within 60 minutes only a data sync will take place. Here only the data changed during the absence will be copied over to the  replica(s).

A rebuild however means that a new replica will be created for all VMDK files being not compliant. This is also referred to as a ‘full data migration’.

To change the delay time see this VMware KB article Changing the default repair delay time for a host failure in VMware Virtual SAN (2075456)

Control  and monitor VSAN rebuild progress
At the moment VMware does not provide a way to control and monitor the progress of the rebuild process. In the case described at Reddit basically VMware advised ‘wait and it will be alright’. There was no way to predit for how long the performance of all VM’s stored on VSAN would be badly affected because of the rebuild. The only way to see the status of a VM is by clicking on a VM in the vSphere web client. Then select its storage policies tab, then clicking on each of its virtual disks and checking the list – it will tell you “Active”, “Reconfiguring”, or “Absent”

For monitoring  VSAN Observer provides insight on what is happening.

Also looking at the clomd.log could give indication of what is going on. This is the logfile of the Cluster Level Object Manager (CLOM)

It is also possible to use command line tools for administration, monitoring and troubleshooting. VSAN uses Ruby vSphere Console (RVC) command line. Florian Grehl wrote a few  blogs about managing VSAN using RVC

The VMware VSAN Quick Troubleshooting and Monitoring Reference Guide has many details as well.

Possible cause
It looks like the VSAN rebuild process which started exactly 60 minutes after having added extra storage initiated the I/O storm. VSAN was correcting an incompliant storage profile and started to recreate replica’s of VMDK objects.

A possible cause for this I/O storm could be that the rebuild of almost all VMDK files in the cluster was executed in parallel.  However according to Dinesh Nambisan working for the VMware VSAN product team;

 “VSAN does have an inbuilt throttling mechanism for rebuild traffic.”

VSAN seems to use  a Quality of Service system for throttling back replication traffic. How this exacty works and if this is controlable by customers is unclear. I am sure we will soon learn more about this as this seems key in solving future issues with low-end controllers and HDDs combined with a limited number of storage nodes.

While the root cause has yet to be determined a combination of configuration choices could have caused this:

1. Only three servers in the VSAN cluster were used for storage. When 1 failed only two were left. Those two both were active in rebuild for about 77 virtual machines at the same time.
2. Using SATA 7200 rpm drives as the HDD persistent storage layer. Fine for normal operations when SSD is used for cache. In a rebuild operation not the most powerfull drives having low queue depths.
3. Using an entry level Dell PERC H310 disk controller. The queue depth of this controller is only 25 while advised is to use a controller with 250+ queue depth.

Some considerations
1. Just to be on the safe side use controllers with at least 250+ queue depth
2. for production workloads use N+2 redundancy.
3. use NL-SAS drives or better hdd. These have much higher queue depths (256) compared to SATA hdd (32).
4. in case of a failure of a VSAN storage node: try to fix the server by swapping memory/components to prevent rebuilds. A sync is always better than a rebuild.

5. It will be helpfull if VMware added more control for the rebuild process. When n+2 is used, rebuild could be scheduled to be executed only during non-business hours. Also some sort of control of priority on which replica’s are rebuild first would be nice. Something like this:

in case n+1: tier 1 vms rebuild after 60 minutes. tier 2,3  rebuild during non-business hours
in case n+2: all rebuilds only during non-business hours. Tier 1 vm’s first, then tier 2 then tier 3 etc.

Some other blogs about this particular case
Jeramiah Dooley Hardware is Boring–The HCL Corollary

Hans De Leenheer VSAN: THE PERFORMANCE IMPACT OF EXTRA NODES VERSUS FAILURE

Some usefull links providing insights into VSAN

Jason Langer : Notes from the Field: VSAN Design–Networking

Duncan Epping and others wrote many postings about VSAN. Here a complete overview.

A selection of those blog posts which are interesing for this case.
Duncan Epping How long will VSAN rebuilding take with large drives?
Duncan Epping 4 is the minimum number of hosts for VSAN if you ask me
Duncan Epping How VSAN handles a disk or host failure
Duncan Epping Disk Controller features and Queue Depth?

Cormac Hogan VSAN Part 25 – How many hosts needed to tolerate failures?

Cormac Hogan Components and objects  and What is a witness disk 

 

VMware Virtual SAN Hardware Design Guide released

VMware Virtual SAN (VSAN) uses local server storage and presents it as shared storage. Benefits are lower costs, simplicity, performance and agility. Server hardware which supports Virtual SAN either can be bought (VSAN ready nodes) or assembled by the customer. VMware just released a technical whitepaper titled “Virtual SAN Hardware Guidance” It explains which hardware components should be used when assembling a server yourself. Topics covered in the Virtual SAN Hardware Guidance whitepaper includes

  • Server Form Factors
  • Server Boot Devices
  • Flash Devices
  • Magnetic Hard Disk Drives
  • Storage Controllers
  • Networking

more info in this VMware blog

VMware releases vCenter Converter Standalone 5.5.1. Adds VSAN support

VMware Converter is a software tool used for creation of VMware virtual machines from sources like physical servers, other vendor virtual machines or disk images. It is a free to use tool.
The VMware vCenter Converter Standalone 5.5.1 is an update release that fixes important issues and adds the following new features:

  • Support for vSAN
  • Support for DSA authentication for Linux conversions

Release notes are here

Download here.

 

 

VMware Virtual SAN (VSAN) is now available for download. Licensed per CPU or user.

VMware VSAN 5.5 is available for download now.

In order to run Virtual SAN, you need to download vSphere 5.5 U1 (or above). There are no additional binaries required. Virtual SAN does require a separate license, and is available for free 60 days evaluation period along with vSphere. Please note if you plan to use Virtual SAN with VMware Horizon View, you’ll need to download a specific Horizon View version 5.3.1 binary that supports Virtual SAN.

At least three vSphere 5.5 ESXi nodes are required to be able to use VSAN. The maximum number of nodes in a VSAN 5.5 cluster is 32.

As you noticed, the first release of Virtual SAN is named  ‘Virtual SAN 5.5’ . It will be available in two editions and one bundle (temporary offer).

Here is the official announcement on the VMware company blog about the GA of Virtual SAN 5.5. It includes pricing details listed below.

  • Standalone edition is licensed per CPU and costs $ 2,495. It features Persistent data store, read/write caching, policy-based management, virtual distributed switch, replication, snapshot and clones . This edition can run any workload; either virtual servers or desktops.
  • VSAN can also be licensed per user for VDI deployment only. (VMware or Citrix). Concurrent or named. The costs per user is $ 50. The features are the same as the per CPU licending.
  • There is also a softbundle of Virtual SAN and vSphere Data Protection Advanced (VDPA) which costs $ 2875,- per CPU. This is a limited offer which expires September 15, 2014. This is a VERY GOOD deal. The listprice of VDPA is $ 1100,-. So you save $ 720,- per CPU when purchasing this bundle.

There will be some special offers for VSAN:

  • customers using the beta will receive a discount of 20%.
  • customers using VMware Virtual Storage Appliance (VSA) will get reduced pricing when they upgrade to VSAN.

VSAN can be installed on self assembled hardware components which are listed on the VSAN HCL. Another option is to buy

VSAN Ready Node and VSAN Ready Block hardware configurations . A Ready Node recommended configuration is a single pre-configured
server for use with VIrtual SAN. A Ready Block recommended configuration is a pre-cofigured set of servers for use with Virtual SAN.

A list of vendors supplying those is listed here. 

Download vSphere 5.5 Update 1 which includes VSAN here.

A large collection of links to blogposts on VSAN here and here

VMware release a new VSAN Design and Sizing Guide edition March 2014 which can be downloaded here.

VMware has a free Hands-on Lab (HOL)available which enables you to play and explorer with VMware VSAN. No need to have hardware, software and licenses. The HOL is running in the cloud.

This is a VSAN license calculator showing costs for vSphere, VSAN, S&S

VMware VSAN will be general available in March 2014

At VMware Partner Exchange 2014 some interesting details were made public about features of VMware Virtual SAN (VSAN) and it general availability.

VMware Virtual SAN (VSAN) is a very popular product even when it has not been released yet. It is currently available in public beta. Over 10.000 people joined this beta. More on the background of VMware VSAN in my blog here.

VMware is currently offering beta customers a 20% discount on Virtual SAN purchases.  The discount is available to those beta participants who have joined and downloaded the Virtual SAN beta product.

I guess VMware estimates  the GA of VSAN will lead to a significant growth in sales. VSAN can be used in many customer environments currently running vSphere. It might also boosts vSphere license upgrades as vSphere 5.5 or later is required to run VSAN.  It  promises to become a pretty disruptive technology.

VSAN creates out of local server storage (SSD and HDD) a shared storage solution provinding best performance for lower costs compared to general purpose storage arrays. Host based SSD is used for caching, HDD is used for persistent storage of virtual machine hard disk files.

Benefits of VSAN are:

  • Reduce investment costs by using cheap low cost storage instead of expensive SAN
  • Pay as you grow model instead of large upfront investments. If you need more storage capacity simply add SDD or HDD instead of having to buy a new SAN extension.
  • It lowers operational costs because it is simple to use, does not require a storage administrator and has increased automation

Some more details on VSAN became public in the last couple of days thanks to VMware Partner Exchange (PEX). The information below was extracted from a recent blogpost of Chuck Hollis.

1. VSAN will be generally available in Q1 2014 (confirmed). At March 6 a VSAN webinar is scheduled. This is hosted by VMware CEO Pat Gelsinger and CTO Ben Fathi. These executives are likely to announce some surprises. In the past we have seen announcements of new releases of VMware vSphere at similar webinars presented by the CEO. So a good *guess* would be that at March 6 the GA data of VSAN will be announced.  Maybe March 6 will be the GA date.

2. VSAN will be made available as a seperate stock keeping unit (SKU). This means VSAN is not included in the vSphere license.

3. A VSAN cluster will support at least 16 nodes at GA.

4. VSAN can be installed by customers by buying their own parts like controllers, SDD and HDD. It will also be possible to buy preconfigured servers of IBM, Dell and Cisco. These contain all the VSAN required components which are listed on the VSAN hardware compatibility guide. For example will combine SanDisk SSDs within the Dell PowerEdge R720 and PowerEdge T620 servers to power VSAN.

5. Each VSAN node can support up to 35 disk drives (in addition to up to 5 SSD or PCI-e flash devices). A max of 560 spindles in a single VSAN cluster is supported.

A lot of questions are asked on the pricing of VSAN. Pricing has not been announced yet. Licensing will be based on the number of CPU sockets in the nodes part of a VSAN cluster (unconfirmed).

Duncan Epping has another summary about VSAN here. 

 

 

VMware releases VSAN Beta refresh update

VMware has released an update for Virtual SAN (VSAN) beta. The software is currently available as a public beta and can be downloaded by anyone. The general availability of VSAN is not made public at the moment. Also pricing and which vSphere 5.5 editions it will support is unknown.

Updated in this Refresh are:

  • AHCI fix. There was an issue with VSAN losing data under certain conditions when certain AHCI controllers were used. This has now been fixed.
  • New Ruby Virtual Console Commands. For depth analysis of the performance of a Virtual SAN cluster
  • A disk group may now contain a single SSD and up to seven HDDs
  • a set of VSAN PowerCLi cmdlets have been released as a fling from VMware R&D

More information here.

%d bloggers like this: