Cluster Validation - BEWARE!!! - SQL Server Consulting

on June 27, 2020

Subtitle: SHAME ON YOU MICROSOFT!

This is not a new behavior. It’s not a new risk. It’s the same old risk that’s been in Windows Failover Clustering for quite some time. But it’s worse for SQL Server environments that use a Windows failover cluster for an Availability Group – since you are not sharing storage for an AG (unless for some odd reason you wanted to have a quorum drive.)

I’m pretty sure I learned this one the hard way myself when the task of adding a new node to a windows cluster fell on me. Everything I had read suggested you should run a validation. Good to make sure all is ready, right? I mean. It’s called validation that sounds like a safe and mostly good thing. Anyway, I ran it and all of the drives went away. In this case wayyyy back when it wasn’t a big deal, it was pre-production and the drives weren’t vital – but my heart stopped beating, my stomach knotted up and my vision narrowed. Time slowed. Yup. An adrenaline dump.

Short story: When you choose the default options in Validation – any storage that could be possibly shared storage is taken offline and brought into the cluster’s available storage role. And I mean any storage. So if you run validation on a live Windows Failover Cluster that is hosting, say, an Availability Group, with disks backed by iSCSI – those drives will be sucked away from the nodes in the cluster and brought into the available storage. Unceremoniously your drives will all disappear from you in file explorer, you SQL Server databases will come crashing down and folks will start calling you asking “Where did the drives go?! Where is my database?! WHAT THE HECK DID YOU DO!!!!!!!” And then when you sort it out and bring stuff back online – you could have corrupt databases, because you didn’t really tell SQL Server, “pssst…. Hey.. I’m taking your drives offline and in no specific order. Good luck!”

I get reminded of the flaw of Microsoft not having a big flashing ARE YOU SURE?!?!?!?!?!? dialog every so often when I see someone in a forum have an oops. Most recently, we were helping a brand new client who was suffering some pretty serious performance issues (don’t ask me about what I think about hyperconverged architecture for high performance SQL Servers today.. Suffice it to say things were better with a move away from it back to the “old” In fact in general, don’t ask me that question unless you can handle the truth.. We’ve seen too many scars… Anyway… This is about Microsoft and the one place they still haven’t put a confirmation warning not hyperconverged infrastructure….). We started getting things into a good spot and had some plans in place. And the infrastructure folks were adding a new node to the windows cluster to add to the AG to make a move. All of the sudden everything died. It took me about 90 seconds of troubleshooting, “Shoot! Someone is probably doing a validation of the cluster someplace!!!” I announced on the call. “Yeah. That was happening.” We got the drives back (Removed them all from available storage) and got SQL back, and only had “minimal” corruption (TempDB one of othe nodes, a big DB on the other – thankfully only a secondary in the AG. Primary was clean. Though a weekend checkdb will confirm that…)

So here’s what happens:

The first dialog. No warnings, right? Cool! I’ll just click next, or maybe even Don’t show again (always hated that tick box) and then next.

Cool! No warnings! I mean I could click that link and learn more about this action (and sure. You should be someone who knows a thing or two about clustering. But Microsoft is known for warnings on the silliest of things… Things that are less dangerous..) Oh. And if you click that link. Forget about it on a server that’s locked down and has only IE on it. It brings you to a KB article that starts off reminding you how important validation is. But no huge warning that quickly draws your attention as to the risk you are about to suffer.

Then finally the last screen. I won’t click next here because this is a cluster I need the storage to stay online at. But the rest, as they say, is history.

So. For the love of all that’s good and proper – just instead tick “Run only tests I select” if you don’t want to blink all of your storage offline and have windows clustering say “Yup. That storage is clusterable” or “Nope it wasn’t” while you crush your production environment.

And then.. Just go ahead and untick storage… Your users won’t thank you. You can just quietly know you protected them from yourself (and the Cluster Validation UI) and you lived another day to not have to ponder the state of your resume.

Article by Mike Walsh

Mike loves mentoring clients on the right Systems or High Availability architectures because he enjoys those lightbulb moments and loves watching the right design and setup come together for a client. He loves the architecture talks about the cloud - and he's enjoying building a Managed SQL Server DBA practice that is growing while maintaining values and culture. He started Straight Path in 2010 when he decided that after over a decade working with SQL Server in various roles, it was time to try and take his experience, passion, and knowledge to help clients of all shapes and sizes. Mike is a husband, and father to four great children and lives in the middle of nowhere NH.

6 thoughts on “Cluster Validation – BEWARE!!!”

AjarnMark

October 28, 2020 at 17:36

Hey, Mike, I think it’s even worse than that. There actually IS a big warning there. Not one with big bold red text, but I read it that way anyway…it says, “Microsoft supports a cluster solution ONLY if complete configuration (servers, networks, AND STORAGE) can pass all tests in this wizard.” (emphasis mine). To me, the inexperienced guy, that reads as YOU BETTER RUN THE VALIDATION ON STORAGE OR ELSE! So they actively encourage you to do the bad thing, with the threat that they may not support you if you don’t.

This is worse than no warning at all.

But thankfully, we have guys like you doing this work for us who have learned the lesson elsewhere. THANKS!
Jay

November 15, 2021 at 00:34

FYI for future readers.

“This can be done by simply creating a new cluster disk from the same storage array, exposing it to all nodes and running all tests against just that disk. This gives you the benefits of running Validate against that type of disk to ensure that it will work while not risking any downtime to production workloads. This can be done by running Validate, selecting all the tests, but keeping any running services or applications online.”

https://techcommunity.microsoft.com/t5/failover-clustering/validating-a-cluster-with-zero-downtime/ba-p/371685
- Mike Walsh
  
  November 15, 2021 at 08:39
  
  This is handy for new storage sure – I tend to do my big storage validation at the start before we’re live. Or incur the downtime if swapping storage behind the scenes. But a great point, great post that you shared there. My main point here is “DO not press that darn validate button without understanding what goes on when you click it” 😉
  
  THanks for sharing!
Rakesh Prasad

January 7, 2022 at 14:16

It’s painful to be in this condition even we scratch our head for 3-4 days by verify from windows level to storage and finally Microsoft engineer came with same comment.
There should be a warning while selecting storage validation for already build cluster.
Thanks for the article!!!!!
Adam

January 31, 2023 at 03:46

It’s a trap you find out about in the worst possible way. Microsoft should have long ago added an option to select a specific LUN for validation testing, so that you can select a non-production LUN or one with a test virtual machine that can explode without much trouble.
Pingback: Running Cluster Validation cause drives to disappear and go offline – EugeneChiang.com

Subscribe for Updates

6 thoughts on “Cluster Validation – BEWARE!!!”

Leave a Comment Cancel reply