SQL Server Blog

SQL Server Partition Management Job Failure After Migration to Availabiltiy Group.

This is the first post in our new Case of the Week series — short write-ups from the real SQL Server issues our team of 17 SQL Server experts helps clients solve every day.

Case of the Week: SQL Server Partition Job Failure after Migration from AWS FCI to AWS AG.

Quick Summary

A client noticed one of their reporting tables wasn’t logging any new information after the first of the new month.

Context

This environment ran on SQL Server 2019 in an Always On Availability Group configuration hosted on AWS EC2 servers. This is roughly 30-45 days after the servers were migrated from a SQL Server Failover Cluster Instance in AWS on EC2 to the new AG setup.

The Problem

Before this discovery of the missing data, we had reported to them that a SQL Server Agent job responsible for partition management had failed. This was noticed on our daily checks and SQL Server alerts. The client took action and found that this was due to a folder permission issue — one that hadn’t been copied over to the new servers during their migration process.

The permissions were corrected, and the job succeeded on its next run. It was assumed the problem was fixed since there no more errors. A week later, though, an end user reported that their dashboards weren’t showing new data. That’s when the DBA, application, and development teams came together to investigate and our Senior DBAs were brought into assist with the investigation and troubleshooting.

The Investigation

On the SQL Server side, we reviewed:

  • SQL Agent job history – tracing what happened and when, and understanding the actual status of the steps and the code the scripts were using.
  • Partitioned Table Structures – Ensuring the partitions were intact, the job aimed to do the right thing.

Our team verified the partition metadata was all intact and able to receive new data; next we started exploring the database schema for swapping tables. While we researched, we asked that a developer at the client who was familiar with the table join the call. Once they were on the phone, we asked them to review their logs to see if the application was reporting issues before the data appeared in SQL Server.

As they reviewed their logs on screen, we found an error almost immediately, saying that the database file for the current partition was not available, and data could not be written to the database.

The Fix

We verified that the folder permissions were correct, but confirmed that the next data file itself had never been created.

We manually created the missing database file and associated it with the correct filegroup. Once that was done, data began flowing again normally.

Lesson/Takeaway

Proper alerting and logging could have caught this much sooner:

  • The application had been logging “missing file” errors for days, but no one was alerted.
  • The SQL Agent partition management job could be enhanced to validate the configuration state of partitioning and fire an alert if something is off.
  • The partition management code used to perform the roll forward of dates could have been wrapped in a transaction, ensuring that either all or none of it happens. Wrapping that in a try/catch block and raising a sufficient state error would have also alerted everyone to the real problem.
  • It’s easy to see a job has failed and then runs again, assuming all must be good—90% of the time, it probably is accurate—but when a job doesn’t fail often, doing a bit of a deeper dive could be helpful. This requires a rare commodity for DBA teams these days, though: time.
  • When migrating to new infrastructure – even a lift and shift from AWS to AWS- it’s easy for things outside of SQL Server to be missed. Develop a checklist to look for things like environment variables, file permissions, and dependencies outside of the database. A proper upgrade practice can help here.
  • Better integration between application-level and database-level alerting could have also helped close the gap here.

The client here is good. The lessons learned have been shared. The client teams and our teams worked well together to solve this, and data was flowing in short order.

The Straight Path Team and Skills

I worked with Andy Kelly and Jordan Boich here. We relied heavily on our shared troubleshooting skills that we’ve acquired over our years working with SQL Server. Andy provides an additional layer of expertise on our team for all things partitioning, while Jordan and I focus on the “HA/DR” team, specializing in clustering and availability groups.

This post is part of our Case of the Week series — real SQL Server issues and lessons from the field. If you’re dealing with your own SQL Server mystery, reach out — we love solving these.

Article by Mike Lynn
Mike got his start in computers in college after taking a class about Excel for an accounting major. After that class, he started taking more computer science courses and decided to change majors. After graduation, Mike took a job in Little Rock, AR as a Developer / DBA. The job was working with a .net 1.1 application, a SQL Server 2000 backend, and Microsoft Access/SSRS as the reporting tools. Mike quickly learned how much he enjoyed working with databases and has never looked back. The parts he enjoys most are helping people solve their pain points with data, whether that is helping with a performance problem or designing a new system to solve a particular need. He also enjoys automating work because it allows the person who was doing the work more time to focus on new business problems. Mike has worked with every major version of SQL Server since 2000, with the majority of his time spent on the 2008 R2, 2014, and 2016 releases of SQL Server.

Subscribe for Updates

Name

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share This