SQL Server Blog

Should a SQL Server DBA Know Windows Clustering?

Should a SQL Server DBA know how a Windows cluster works, and or how to create a Windows cluster, or troubleshoot a cluster? Or should we, as DBAs, stay in our lane? In some organizations, a line is drawn between what a DBA can and can’t do and the System Administrator (SA) has the Infrastructure roles and responsibilities. This is fine, but that doesn’t mean a DBA shouldn’t know how to troubleshoot a FCI (Failover Cluster Instance) or AG (Availability Group) issue after an unplanned cluster-level failover.  (Yes – you can create an AG without a cluster, but that will not be taken into consideration here.)

You may even find that the SA does not know how to build a cluster. I have seen this time and time again. It’s not that they are not capable; it could be that they’ve never had the opportunity to do it, or don’t do it often enough, so they need help creating one. When creating a brand-new cluster and you or the SA mess up, it is not a big deal. You can destroy it (yes, that is what you do when wanting to drop a cluster) and try again.

Other tasks are not as easy to fix. For example, adding a node into an existing cluster can be challenging and must be done correctly, or you can inadvertently cause your disks to turn into a RAW state. This is bad, as in “you just brought the environment down hard” (hope it wasn’t production!). When this happens, opening a call with Microsoft to help resolve the disk issue will be required, or, if you are lucky, the SA might know how to fix it, or you can use AI tools to help troubleshoot or fix it. Regardless of how the disk issue is fixed, this must be done before the cluster and AG or FCI can be restarted.

Over the years, I’ve learned how to do all the above and feel comfortable completing tasks that are not always typical for a DBA. Not every DBA or Systems Administrator has that knowledge, which is OK. But at a minimum, you as a DBA should be able to troubleshoot why an FCI or AG had an unplanned failover or had issues, so you can communicate that to the SA or Infrastructure team.

If you do not have access to the cluster in Failover Cluster Manager, ask the System Administrators to grant you access. It is hard to do your job if you can’t see what is going on with the cluster. Is the cluster started? Are key resources necessary for the cluster offline that can be brought online without errors? Are the nodes happy? Hopefully, one node is not quarantined. Is the role started, with its dependencies started?

When I do a cluster review, I gather as much information on that cluster, FCI, AG build, as I can. If something goes south (ransomware, hardware issues, VMs deleted, etc.), having this information documented will allow you to rebuild quickly. We had a client that was hit with ransomware, and we were able to quickly rebuild the servers with the information we had gathered during one of my cluster reviews, since the SAs did not have cluster or listener names/IPs, quorum information, or IPs for the nodes.

Document the Windows Failover Cluster and SQL Server Availability Group or Failover Cluster Instance Settings

Some people hate writing documentation, but others love it. I am in the ‘loving it’ category. Even if you did not build the cluster and AG/FCI, you, as a DBA, should familiarize yourself with how it is built. Document what you see.

  • How is the cluster built?
  • How many nodes?
    • What are all the IP addresses of the node(s)
    • If multiple NICs on each node, is the binding order correct (most servers only have one NIC but there could be multiples)
  • Is quorum set? Should it exist?
  • Is a node set to not have a vote?
    • If so, try to find out why – it could be a simple as it was turned off due to location compared to the other nodes in the cluster.
  • Are there settings that are not default? If not that is good, but if set too high ask someone why.
    • i.e. lease timeout (20000), healthcheck timeout (30000), max failures in a specified period (n-1 where n is the number of nodes in the cluster), failback settings, multi-subnets.
  • Is hostrecordTTL (1200) and RegisterAllProviders (1) changed from default?
  • Are there SPNs created, etc. etc.

If something seems off, ask why it was set that way. If using quorum, is it disk, fileshare, or cloud witness? If fileshare, does the destination still exist and are the permissions set correctly at the share and security level? If cloud witness, do you have the information in the event you need to set it back up or rebuild, and do you want to use the same location? I honestly do not like disk quorums, I think fileshare or cloud witnesses are the way to go. If you have enough nodes in a cluster, you do not necessarily need quorum. For an AG, in Active Directory, does the cluster have full control of the listener? If not, it should.

Troubleshooting Clusters for SQL Server DBAs

When troubleshooting issues, review the SQL error log, the event viewer (system log), and the cluster log. Make sure you look at the other nodes that are part of the cluster as well, especially if you are having a hard time trying to find out what the issue is. Be careful and do not go down a rabbit hole. What I mean by this is, you may see something during or right before the outage and think that is the problem, but it might be a false alarm. Check to see if you see that same error at other times in the log when there wasn’t an issue to make sure that it is indeed the problem or just an issue that happened to show up when the issue occurred.

If you see errors in the event viewer that happen all the time that a SA needs to fix, let them know. It makes life easier when there are not a lot of entries in the system log that should have been corrected.

At Straight Path, we created a job that parses the system log looking for FailoverClustering events, which then causes the job to fail if it sees that entry. This prompts our support team to create a ticket so we can look at system logs and see if anything needs to be fixed before it causes a P1 outage. At one of my clients, the fileshare witness server was decommissioned. We were not told that server was decommissioned. FailoverClustering events were in the logs, but it did not cause an issue until one of the nodes that was part of the cluster was patched then rebooted which brought ALL the FCIs to go down on the active node. This issue prompted us to have a job be created to monitor the system log.

I always look at the logs first and follow whatever path they lead. However, you do not need to look up every entry in the cluster log. When you look at the cluster log for the first time, it can be very overwhelming. There is a lot of information that may not make sense to you.

If you tried to understand everything in that log, it might take you months to get through it and you still might not understand everything. I, for one, do not understand every little thing, but I understand enough to find the issue if the issue is at the cluster level. Zeroing in on the correct date\time will save your sanity.

The first time I looked at a cluster log, I was like “nope, not going to do it,” and looked at SQL error and event logs to help me troubleshoot an issue. Eventually, I did spend time looking at them and trying to understand what each section meant. Looking at those cluster logs when you are not dealing with a P1 issue is the best time to try and understand what each section tells you. Remember you “should” look at all cluster logs, since one log might not show the issue since that server did not have the problem when the log on the other server did.

However, if you use the local time parameter, you can easily find the time & date of the failover and times will match in the SQL Server logs and system log.

Get cluster log

Run a PowerShell script to output the cluster log on one of the nodes that is part of the cluster. When you run the command, it will create a log on ALL nodes that are part of the cluster. Look at the cluster log on all nodes that are part of the cluster not just one. One node may show issues that caused the actual problem that might not show up on the other node(s). You should not assume it is the same on each node.

Cluster Log parameters I think are useful for DBAs

URL to see other parameters you can use – https://docs.microsoft.com/en-us/powershell/module/failoverclusters/get-clusterlog?view=windowsserver2022-ps

Use local time

By default, the log will be created using Greenwich Mean Time (GMT). If you want to see the times that are server times, then add the parameter -uselocaltime. This will make troubleshooting easier since times will use server time.

PS> get-clusterlog -useLocalTime

Use time span

When trying to determine how far to go back, do not pick a low number of minutes. You want to make sure that “error” is indeed the error and not just something that happens all the time. Avoid the rabbit hole.

7200 – 5 days

4320 – 3 days

1440 – 1 day

360 – 6 hrs

PS> get-clusterlog -uselocaltime -TimeSpan 7200

What does the cluster log tell you?

The default location of the cluster log is c:\windows\cluster\reports\cluster.log. You can decide to put the log locally for each node if you want by using parameter -Destination. I don’t tend to use that destination parameter since I forget that it exists.

Cluster log sections

There are several different sections in the log.  The ones I tend to look at are as follows. To make things easier and avoid scrolling, I search by date and time. That gets me to the correct location quickly especially if the log is long.

[=== Cluster ===]

Provides the node name where the cluster came from. This is good if you have several cluster logs open at the same time.

[=== System ===]

This section also provides information about what is going on. Just because a node is removed from the cluster due to communication issues doesn’t necessarily mean something happened to the AG or FCI. I have seen servers get removed out of the cluster then added back but SQL Server and the cluster were happy.

Example of password change (not an issue) and node being removed from the cluster due to lost communication (in this instance it was an issue):

[=== Microsoft-Windows-FailoverClustering/Operational logs ===]

This section tells you what happened with the role (AG name or FCI) failover. This helps with timing when looking at SQL Logs and the system log, and searching the cluster log based on date\time. You can see communications issues, etc.

Example of communication issue:

[=== Cluster Logs ===]

Gives more in-depth information on what happened. This section will also tell you about communication issues. When you see an issue, make sure that same error is not happening when things are fine.

Example of communication issue on port 3343 which is the port the cluster communicates on:

Common Issues

I have seen third-party tool snapshots at the disk level cause issues within the AG. Bumping up cluster settings did not help. I used SQLSentry to see that the disks were locked during that snapshot which caused the AG to lose connections and caused a connection timeout which of course kicks everyone out of the AG so it could then re-establish the connections. I had the client stop doing snapshots of the servers. I verified they were backing up the backups which was critical. Mike blogged about the “fun” we see when a client deploys SentinelOne without asking us first. It’s not just snapshots, by the way, use that post from Mike and check your AV setings for your clusters hosting your SQL Server AGs and FCIs.

When dealing with AGs or FCI unplanned failovers, you should make sure SQL Server is up and running, and happy. For an AG, make sure the dashboard is green then check if SQL Logs gives you a clue on why it occurred. You should also look at the cluster log and event viewer (system log) as well to try to determine why there was an issue. Was it a communication issue between the nodes? Was it a lease timeout, connection timeout, network blip which caused communication issues, all cluster and SQL AG setting set to defaults which brought communication issues to the forefront, disk issues, etc.? If your cluster is not stable (communication issues), you need to find out why so you can let the network team know. 

An offline cluster does not always cause problems within an AG. I have seen where the cluster was down (offline), but the AG was happy until I wanted to add DBs into the AG, it failed due to the cluster being offline. I had to start the cluster (online it) then I was able to add user databases into the AG with no issues.

Conclusion

I believe SQL Server DBAs can and should understand how to troubleshoot Availability Group and Failover Cluster Instance issues at the Windows Clustering level. If a cluster is not stable, you may be spinning your wheels trying to find the problem at the SQL level when the SQL issue was caused at the cluster level due to a communication issue at the network level. SQL Server tends to be blamed for all issues until you prove otherwise.

Article by Sandra Delany
Sandra started with Straight Path in April as a Senior SQL Server Consultant. She started her career supporting Local Area Networks, troubleshooting network, hardware and software problems. She decided to take classes learning Oracle and got certified. Around that same time, she had the opportunity to learn SQL Server and stayed with it ever since. She has many many years of SQL Server database experience as an application DBA/developer and production DBA in the private and public sector supporting multiple clients. She has experience with SQL Server 7.0-2019 (SSIS, SSRS, SSAS) and Visual Studio 2005/2008/2010. She recently received her Microsoft Certified Professional Cloud Platform certification.

Subscribe for Updates

Name

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share This