Failed to bring CSV online after boot of Hyper-V cluster nodes
April 26, 2012 1 Comment
One of my customers had a power outage today. Because the system administrator was in the office he shutdown the three Hyper-V hosts manualy before the UPS battery would be out of power.
The infrastucture has three Windows Server 2008 R2 hosts with Hyper-V enabled. Shared storage is a HP StorageWorks P2000 G3 MSA connected using iSCSI.
When the power was restored the Hyper-V hosts booted automatically. Three of the four Cluster Shared Volumes were back online without any problems. However the fourth CSV was not online. Trying to bring the volume online manualy resulted in a failed status. Event ID 1069 was shown in the eventlog of Failover Cluster Manager. Mmm, problem as a couple of critical virtual machines were running on this volume before the power failure. Houston, we have a problem.
In the iSCSI initiator paths to all four volumes were re-established automatically after the reboot. So the iSCSI fabric seemed to be functional.
Looking in the c:\clusterstorage the volume4 (the missing volume) was not shown anymore on any of the three Hyper-V nodes.
The Windows eventlogs did not give much information on a possible cause of this issue. So we decided to run the cluster.exe log /g command. The logfile with filename cluster.log is created in %systemroot%\cluster\reports. More info on cluster.exe on Microsoft.com
The cluster.log file showed for the failed volumesthis line PR reserve failed, status 170 . Other errors shown in the log were
Failed to preempt reservation, status 170
OnlineThread: Unable to arbitrate for the disk. Error: 170 OnlineThread: Error 170 bringing resource online
What we did next was using this PowerShell command on all of the Hyper-V nodes. To find out the number after Disk use Disk Management and find the disk id of the failed volume.
Not sure what caused the problem. It could be that because all three nodes where booting at the same time multiple nodes were trying to claim ownership of a LUN and corrupted the registration table of the HP MSA 2000.
I earlier explained Cluster Shared Volumes in a post titled Clustered Shared Volumes explained, design impact and best practises
HP has an article explaining PR registration.
RESOLUTION The HP StorageWorks P2000 MSA Disk Arrays count total PR registrations on a system-wide basis until a limit of 1024 is reached. There might be one PR registration for every LUN up to a maximum of 128 per LUN until the 1024 system limit is reached. Reducing the number of PR registrations will eliminate the error and allow path failover to work correctly.
an explanation of reservations can be read in this thread.
Check the source for more information http://itinfras.blogspot.com/2010/03/storage-architecture-changes-for.html