SQL Server Blog

Troubleshooting SQL Server Failover Cluster Instance Weirdness

Case of the Week:

Quick Summary

Two of the four clients’ SQL Server Failover Cluster Instances (FCIs) did not fail over to the secondary node.

Context

This is a SQL Server 2019 with four load-balanced FCI’s running on two physical servers with 64 CPUs, 4TB of memory, and a Pure Storage array on the backend. Before this failover, when the failovers did work, I created a new FCI and was working with the System Administrator to set permissions correctly at the OU level so the cluster could create the computer object in Active Directory.

He asked me about two computer objects that were moved to a container where computer objects would eventually be deleted. He has a PowerShell script that moves computer objects that do not exist on the network. The two computer objects are used with two of the FCIs so I told him not to delete them and move them to the correct OU. We did that and I was able to create the new FCI with no issue. He did modify his script after our conversation.

The Problem

A couple of weeks have gone by with no interruption, and I need to do SQL patching. Those two FCIs, where the computer objects were moved, now fail when I try to do a failover. Nothing in the logs to give me any hint as to why. I ended up stopping the SQL Service for those failed FCIs on the active node so I could do patching. Once that was done, I turned them back on. I did not have a large maintenance window to troubleshoot this further. I had to patch all SQL Servers.

The Investigation

I compared the registry settings on both nodes for the failed failover FCIs and for those that had no issues. I checked all the logs like I blogged about in the blog post asking if SQL Server DBAs should understand Windows Server Failover Clustering, and nothing gave me a hint as to why it failed. I checked Failover Cluster Manager, and nothing seemed off there. I asked the System Administrator to do a screen share so I could look at the computer objects and permissions for those two computer objects that we moved, and everything looked good.

The Fix

During the next patching cycle, I tried the failover again and same failure. For the smaller of the two FCIs, I removed the secondary from the FCI, rebooted, and added it back, then patched the server. I tested the failover for that one FCI, and IT FAILED OVER, and I failed back and forth just to be sure! I did the same thing for the other FCI, and yes, I patched it again so that the newly installed secondary instance had the correct patch level. I failed over everything and patched the other node, then load-balanced the instances afterwards.

Lesson/Takeaway

Sometimes you have to think outside the box when nothing else makes sense. It was a shot in the dark, but it worked. Now I don’t have to worry about those two instances if something happens to the only node, it would run on. Understanding clustering and where to look for issues and what to rule out, helped me have the confidence that we were good here as well.

Article by Sandra Delany

Sandra started with Straight Path in April as a Senior SQL Server Consultant. She started her career supporting Local Area Networks, troubleshooting network, hardware and software problems. She decided to take classes learning Oracle and got certified. Around that same time, she had the opportunity to learn SQL Server and stayed with it ever since. She has many many years of SQL Server database experience as an application DBA/developer and production DBA in the private and public sector supporting multiple clients. She has experience with SQL Server 7.0-2019 (SSIS, SSRS, SSAS) and Visual Studio 2005/2008/2010. She recently received her Microsoft Certified Professional Cloud Platform certification.

Subscribe for Updates

Name

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share This