Case of the Week:
Quick Summary
Two of the four clients’ SQL Server Failover Cluster Instances (FCIs) did not fail over to the secondary node.
Context
This is a SQL Server 2019 with four load-balanced FCI’s running on two physical servers with 64 CPUs, 4TB of memory, and a Pure Storage array on the backend. Before this failover, when the failovers did work, I created a new FCI and was working with the System Administrator to set permissions correctly at the OU level so the cluster could create the computer object in Active Directory.
He asked me about two computer objects that were moved to a container where computer objects would eventually be deleted. He has a PowerShell script that moves computer objects that do not exist on the network. The two computer objects are used with two of the FCIs so I told him not to delete them and move them to the correct OU. We did that and I was able to create the new FCI with no issue. He did modify his script after our conversation.
The Problem
A couple of weeks have gone by with no interruption, and I need to do SQL patching. Those two FCIs, where the computer objects were moved, now fail when I try to do a failover. Nothing in the logs to give me any hint as to why. I ended up stopping the SQL Service for those failed FCIs on the active node so I could do patching. Once that was done, I turned them back on. I did not have a large maintenance window to troubleshoot this further. I had to patch all SQL Servers.
The Investigation
I compared the registry settings on both nodes for the failed failover FCIs and for those that had no issues. I checked all the logs like I blogged about in the blog post asking if SQL Server DBAs should understand Windows Server Failover Clustering, and nothing gave me a hint as to why it failed. I checked Failover Cluster Manager, and nothing seemed off there. I asked the System Administrator to do a screen share so I could look at the computer objects and permissions for those two computer objects that we moved, and everything looked good.
The Fix
During the next patching cycle, I tried the failover again and same failure. For the smaller of the two FCIs, I removed the secondary from the FCI, rebooted, and added it back, then patched the server. I tested the failover for that one FCI, and IT FAILED OVER, and I failed back and forth just to be sure! I did the same thing for the other FCI, and yes, I patched it again so that the newly installed secondary instance had the correct patch level. I failed over everything and patched the other node, then load-balanced the instances afterwards.
Lesson/Takeaway
Sometimes you have to think outside the box when nothing else makes sense. It was a shot in the dark, but it worked. Now I don’t have to worry about those two instances if something happens to the only node, it would run on. Understanding clustering and where to look for issues and what to rule out, helped me have the confidence that we were good here as well.