Subtitle: SHAME ON YOU MICROSOFT!
This is not a new behavior. It’s not a new risk. It’s the same old risk that’s been in Windows Failover Clustering for quite some time. But it’s worse for SQL Server environments that use a Windows failover cluster for an Availability Group – since you are not sharing storage for an AG (unless for some odd reason you wanted to have a quorum drive.)
I’m pretty sure I learned this one the hard way myself when the task of adding a new node to a windows cluster fell on me. Everything I had read suggested you should run a validation. Good to make sure all is ready, right? I mean. It’s called validation that sounds like a safe and mostly good thing. Anyway, I ran it and all of the drives went away. In this case wayyyy back when it wasn’t a big deal, it was pre-production and the drives weren’t vital – but my heart stopped beating, my stomach knotted up and my vision narrowed. Time slowed. Yup. An adrenaline dump.
Short story: When you choose the default options in Validation – any storage that could be possibly shared storage is taken offline and brought into the cluster’s available storage role. And I mean any storage. So if you run validation on a live Windows Failover Cluster that is hosting, say, an Availability Group, with disks backed by iSCSI – those drives will be sucked away from the nodes in the cluster and brought into the available storage. Unceremoniously your drives will all disappear from you in file explorer, you SQL Server databases will come crashing down and folks will start calling you asking “Where did the drives go?! Where is my database?! WHAT THE HECK DID YOU DO!!!!!!!” And then when you sort it out and bring stuff back online – you could have corrupt databases, because you didn’t really tell SQL Server, “pssst…. Hey.. I’m taking your drives offline and in no specific order. Good luck!”
I get reminded of the flaw of Microsoft not having a big flashing ARE YOU SURE?!?!?!?!?!? dialog every so often when I see someone in a forum have an oops. Most recently, we were helping a brand new client who was suffering some pretty serious performance issues (don’t ask me about what I think about hyperconverged architecture for high performance SQL Servers today.. Suffice it to say things were better with a move away from it back to the “old” In fact in general, don’t ask me that question unless you can handle the truth.. We’ve seen too many scars… Anyway… This is about Microsoft and the one place they still haven’t put a confirmation warning not hyperconverged infrastructure….). We started getting things into a good spot and had some plans in place. And the infrastructure folks were adding a new node to the windows cluster to add to the AG to make a move. All of the sudden everything died. It took me about 90 seconds of troubleshooting, “Shoot! Someone is probably doing a validation of the cluster someplace!!!” I announced on the call. “Yeah. That was happening.” We got the drives back (Removed them all from available storage) and got SQL back, and only had “minimal” corruption (TempDB one of othe nodes, a big DB on the other – thankfully only a secondary in the AG. Primary was clean. Though a weekend checkdb will confirm that…)
So here’s what happens:
So. For the love of all that’s good and proper – just instead tick “Run only tests I select” if you don’t want to blink all of your storage offline and have windows clustering say “Yup. That storage is clusterable” or “Nope it wasn’t” while you crush your production environment.
And then.. Just go ahead and untick storage… Your users won’t thank you. You can just quietly know you protected them from yourself (and the Cluster Validation UI) and you lived another day to not have to ponder the state of your resume.
Hey, Mike, I think it’s even worse than that. There actually IS a big warning there. Not one with big bold red text, but I read it that way anyway…it says, “Microsoft supports a cluster solution ONLY if complete configuration (servers, networks, AND STORAGE) can pass all tests in this wizard.” (emphasis mine). To me, the inexperienced guy, that reads as YOU BETTER RUN THE VALIDATION ON STORAGE OR ELSE! So they actively encourage you to do the bad thing, with the threat that they may not support you if you don’t.
This is worse than no warning at all.
But thankfully, we have guys like you doing this work for us who have learned the lesson elsewhere. THANKS!
FYI for future readers.
“This can be done by simply creating a new cluster disk from the same storage array, exposing it to all nodes and running all tests against just that disk. This gives you the benefits of running Validate against that type of disk to ensure that it will work while not risking any downtime to production workloads. This can be done by running Validate, selecting all the tests, but keeping any running services or applications online.”
https://techcommunity.microsoft.com/t5/failover-clustering/validating-a-cluster-with-zero-downtime/ba-p/371685
This is handy for new storage sure – I tend to do my big storage validation at the start before we’re live. Or incur the downtime if swapping storage behind the scenes. But a great point, great post that you shared there. My main point here is “DO not press that darn validate button without understanding what goes on when you click it” 😉
THanks for sharing!
It’s painful to be in this condition even we scratch our head for 3-4 days by verify from windows level to storage and finally Microsoft engineer came with same comment.
There should be a warning while selecting storage validation for already build cluster.
Thanks for the article!!!!!
It’s a trap you find out about in the worst possible way. Microsoft should have long ago added an option to select a specific LUN for validation testing, so that you can select a non-production LUN or one with a test virtual machine that can explode without much trouble.