SQL Server Blog

SQL Server Case of the Week: FCI in AWS with SIOS Won’t Failover

SQL Server case study of the week

Quick Summary

We received an on-call call from a client who was attempting to run a failover from on a SQL Server Failover Cluster Instance to another but the failover wouldn’t move to the secondary. They tried things, were using the things we had taught them about failing over for their patching or for other maintenance, but to no avail, so they escalated to us.

Context

These folks are in AWS on EC2 instances and they are using SIOS DataKeeper to present their EBS storage as shared (There are other approaches to accomplish that as well, each have pros and cons. In our testing at the time we implemented this – EBS volumes that are properly selected for performance and availability gave more 9s of availability and better latency guarantees than the options like FSX available at the time. SIOS Datakeeper is definitely a tool in our toolbox we use a lot in different scenarios. )

Investigation and Problem

Upon investigation, we found in the cluster error log that SIOS Data Keeper was preventing the drives from being moved to the second server.   While investigating that error in the SIOS Datakeeper interface, I found that data movement between the two servers was paused due to a drive-size inconsistency.  T

he client verified the drive was expanded on the primary server over the weekend because they were running low on space.   When looking at the secondary server, we found that space was available to the drive, but the drive was never expanded to use it inside Windows.  

With SIOS Datakeeper, the drive on the secondary server must always be the same size or larger than the drive on the primary server.  If the primary drive becomes larger than the secondary server, it creates a situation where data could be written to a location on disk that doesn’t exist on the secondary, which would be really bad if there was a failover. So it’s a safety mechanism.

When we deploy SIOS Datakeeper, we give some instructions and training on growing space. In this case, it looks like they followed the instructions to add to both sides, but probably just had a quick distraction before actually increasing the volume in Windows, leaving the space there, just not usable. A failover could have been bad and Datakeeper and Windows Server Failover Clustering made the right agreement here – no failover until a human helps.

The Fix

We followed up with the client and ensured they understood that the secondary server must always be the same size or larger than the primary server and discussed the “almost completed drive growth.” 

We even asked them to put a note in the spot where their sysadmins track and pass off active work. We also implemented a process that reads the server logs for any SIOS messages indicating drive differences or that data movement has stopped. And we have since made that part of our standards for any client using SIOS Datakeeper.

Lesson/Takeaway

SIOS is an excellent tool because it allows companies to stay on SQL Server Standard and have HA in a different zone in a cloud environment, and even a warm DR standby in a different region in case of a disaster. 

One thing I personally took away was to review this process with clients who use SIOS to ensure they remember that the secondary server’s drives must be the same size or larger than the primary’s at all times. This happened a long time ago, and this check is part of our process now for rollouts.

The Straight Path Team and Skills

I worked with Jeff on this issue, and we used our knowledge of AWS failover clustering and SIOS deployment and troubleshooting to identify and resolve it.

Article by Mike Lynn
Mike got his start in computers in college after taking a class about Excel for an accounting major. After that class, he started taking more computer science courses and decided to change majors. After graduation, Mike took a job in Little Rock, AR as a Developer / DBA. The job was working with a .net 1.1 application, a SQL Server 2000 backend, and Microsoft Access/SSRS as the reporting tools. Mike quickly learned how much he enjoyed working with databases and has never looked back. The parts he enjoys most are helping people solve their pain points with data, whether that is helping with a performance problem or designing a new system to solve a particular need. He also enjoys automating work because it allows the person who was doing the work more time to focus on new business problems. Mike has worked with every major version of SQL Server since 2000, with the majority of his time spent on the 2008 R2, 2014, and 2016 releases of SQL Server.

Subscribe for Updates

Name

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share This