Failed to bring CSV online after boot of Hyper-V cluster nodes

One of my customers had a power outage today. Because the system administrator was in the office he shutdown the three Hyper-V hosts manualy before the UPS battery  would be out of power.

The infrastucture has three Windows Server 2008 R2 hosts with Hyper-V enabled. Shared storage is a HP StorageWorks P2000 G3 MSA connected using iSCSI.

When the power was restored the Hyper-V hosts booted automatically. Three of the four Cluster Shared Volumes were back online without any problems. However the fourth CSV was not online. Trying to bring the volume online manualy resulted in a failed status. Event ID 1069 was shown in the eventlog of Failover Cluster Manager. Mmm, problem as a couple of critical virtual machines were running on this volume before the power failure. Houston, we have a problem.

In the iSCSI initiator paths to all four volumes were re-established automatically after the reboot. So the iSCSI fabric seemed to be functional.

Looking in the c:\clusterstorage  the volume4  (the missing volume) was not shown anymore on any of the three Hyper-V nodes.

The Windows eventlogs did not give much information on a possible cause of this issue. So we decided to run the  cluster.exe log /g command. The logfile with filename cluster.log is created in %systemroot%\cluster\reports. More info on cluster.exe on Microsoft.com

The cluster.log file showed for the failed volumesthis line PR reserve failed, status 170 . Other errors shown in the log were
Failed to preempt reservation, status 170
OnlineThread: Unable to arbitrate for the disk. Error: 170 OnlineThread: Error 170 bringing resource online

What we did next was using this PowerShell command on all of the Hyper-V nodes. To find out the number after Disk use Disk Management and find the disk id of the failed volume.

Clear-ClusterDiskReservation -Disk 4
Thereafter we were able to bring the volume online again.
Thanks for this article which helped us with the above command and explains the Persistent Reservation Error in the cluster.log file.

Cause
Not sure what caused the problem. It could be that because all three nodes where booting at the same time multiple nodes were trying to claim ownership of a LUN and corrupted the registration table of the HP MSA 2000.

I earlier explained Cluster Shared Volumes in a post titled Clustered Shared Volumes explained, design impact and best practises

HP has an article explaining PR registration.

RESOLUTION   The HP StorageWorks P2000 MSA Disk Arrays count total PR registrations on a system-wide basis until a limit of 1024 is reached. There might be one PR registration for every LUN up to a maximum of 128 per LUN until the 1024 system limit is reached. Reducing the number of PR registrations will eliminate the error and allow path failover to work correctly.

an explanation of reservations can be read in this thread.
Check the source for more information http://itinfras.blogspot.com/2010/03/storage-architecture-changes-for.html

Advertisements

About Marcel van den Berg
I am a technical consultant with a strong focus on server virtualization, desktop virtualization, cloud computing and business continuity/disaster recovery.

One Response to Failed to bring CSV online after boot of Hyper-V cluster nodes

  1. Reese says:

    You save my life

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: