Quick Summary
We received an on-call call from a client who was attempting to run a failover from on a SQL Server Failover Cluster Instance to another but the failover wouldn’t move to the secondary. They tried things, were using the things we had taught them about failing over for their patching or for other maintenance, but to no avail, so they escalated to us.
Context
These folks are in AWS on EC2 instances and they are using SIOS DataKeeper to present their EBS storage as shared (There are other approaches to accomplish that as well, each have pros and cons. In our testing at the time we implemented this – EBS volumes that are properly selected for performance and availability gave more 9s of availability and better latency guarantees than the options like FSX available at the time. SIOS Datakeeper is definitely a tool in our toolbox we use a lot in different scenarios. )
Investigation and Problem
Upon investigation, we found in the cluster error log that SIOS Data Keeper was preventing the drives from being moved to the second server. While investigating that error in the SIOS Datakeeper interface, I found that data movement between the two servers was paused due to a drive-size inconsistency. T
he client verified the drive was expanded on the primary server over the weekend because they were running low on space. When looking at the secondary server, we found that space was available to the drive, but the drive was never expanded to use it inside Windows.
With SIOS Datakeeper, the drive on the secondary server must always be the same size or larger than the drive on the primary server. If the primary drive becomes larger than the secondary server, it creates a situation where data could be written to a location on disk that doesn’t exist on the secondary, which would be really bad if there was a failover. So it’s a safety mechanism.
When we deploy SIOS Datakeeper, we give some instructions and training on growing space. In this case, it looks like they followed the instructions to add to both sides, but probably just had a quick distraction before actually increasing the volume in Windows, leaving the space there, just not usable. A failover could have been bad and Datakeeper and Windows Server Failover Clustering made the right agreement here – no failover until a human helps.
The Fix
We followed up with the client and ensured they understood that the secondary server must always be the same size or larger than the primary server and discussed the “almost completed drive growth.”
We even asked them to put a note in the spot where their sysadmins track and pass off active work. We also implemented a process that reads the server logs for any SIOS messages indicating drive differences or that data movement has stopped. And we have since made that part of our standards for any client using SIOS Datakeeper.
Lesson/Takeaway
SIOS is an excellent tool because it allows companies to stay on SQL Server Standard and have HA in a different zone in a cloud environment, and even a warm DR standby in a different region in case of a disaster.
One thing I personally took away was to review this process with clients who use SIOS to ensure they remember that the secondary server’s drives must be the same size or larger than the primary’s at all times. This happened a long time ago, and this check is part of our process now for rollouts.
The Straight Path Team and Skills
I worked with Jeff on this issue, and we used our knowledge of AWS failover clustering and SIOS deployment and troubleshooting to identify and resolve it.