Tuesday, June 24, 2014

RDM Luns on a MSCS SQL Cluster

I had an outage on my production MSCS SQL cluster last year which resulted in the loss of storage by the primary node in the cluster. Following some simple troubleshooting steps, we shutdown the passive node (it was not able to start the cluster), then the primary node and restarted them each individually. This brought back the cluster. Seeking for a root cause analysis (RCA), I perused the VMware logs to notice several issues with the RDMs:

Sep 10 12:52:11 vmkernel: 47:02:52:19.382 cpu17:9624)WARNING: NMP: nmp_IsSupportedPResvCommand: Unsupported Persistent Reservation Command,service action 0 type 4

I was seeing this message presented on all hosts that had visibility to the RDM LUN. During a maintenance window, I decided to perform updates on drivers and firmware and noticed that the hosts were taking a extremely long time to restart ESXi after a reboot.

Investigating this further, I looked at the logs during the restart and saw messages similar to:


Sep 13 22:25:56 p-esx-01 vmkernel: 0:00:01:57.828 cpu0:4096)WARNING: ScsiCore: 1353: Power-on Reset occurred on naa.########################


which lead me to believe that it was an issue with the RDM based on the naa.#######.

Looking into the VMware KB, I found this article which was relevant in my situation since I had originally started the cluster in ESXi 4.1, upgraded to 5.0, and eventually to 5.1 U1:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1016106

To fix this issue, I had to run the following command on every host that had visibility to the RDM lun (replaced naa.########### with your specific LUN naa number):


esxcli storage core device setconfig -d naa.################# --perennially-reserved=true

 restarting the hosts now resulted in quick restarts rather than 5-10 minutes.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.