Quick Summary
We received an on-call call from a client who had noticed something was running slowly over the weekend and decided to issue a KILL command. The KILL command did what it does – it started the KILL/ROLLBACK process. They had done that once or twice before, they thought, and didn’t remember it taking long. They wanted to know why it was in that state and how they could tell it to hurry up, or if we could stop it for them.
Context
This was one of their Stored Procedures on a standalone SQL Server 2019 instance with multi-terabyte databases with heavy OLTP traffic (less busy on the weekends but still active).
This job usually runs in 2-4 hours, and when it runs longer, their typical internal steps are apparently to kill the SPID and then restart the job.
The Problem
After performing standard troubleshooting steps, reviewing the data collected during the job run, and verifying that no new records were being scanned, the client killed the SPID that appeared to be hung. After killing the SPID, the command did not immediately roll back as it does 99% of the time.
The client left the transaction alone in KILLED/ROLLBACK for 4 hours before calling us, since it was the weekend. They explained that this system is critical, that getting another attempt at the job was essential, and that it was causing performance issues. They weren’t at the end of their window before panic had to start, but it was getting close to the time to do something.
The Investigation
I was the on-call DBA for the weekend, and once the call came in, I reached out to the client and began connecting. On the SQL Server side, we reviewed:
- SP_Whoisactive – we wanted to see if any blocking SPIDs or locks were preventing the SPID from rolling back. There are tons of extra options in this tool; see sp_WhoIsActive: 5 Fast Fixes for Common SQL Problems for a quick glimpse. We used the blocking options Jeff points out here. We also looked into what was going on with the SPID—how much work was being done, etc.
- SQL Agent job and job history – we checked the job’s typical runtime before it was killed and viewed the actual steps being run that we were now trying to roll back to get an idea of where this could be.
- Transaction Log file size/usage – When rolling back a transaction, you use up log space because the transaction is in progress, and log backups do not free up space for the original transaction or the rollback. If we hit the ceiling on space here and the log is complete, rollback cannot continue, and we are stuck in a holding pattern—or worse. We just browsed in SSMS and also used DBCC SQLPERF(LOGSPACE) to review this – we really don’t want to end up in a situation where the log fills, grows, and then consumes all space. Then we are creating new log files
We looked at the job code, and the client knew exactly what the job was doing (always a big help). When a transaction goes into ROLLBACK, it is single-threaded, so if the original transaction ran in parallel, it’s going to take longer to roll it back than the original execution up to the point of killing the SPID, typically. “Rules” around KILL are “iffy” and sometimes strange. But in general, you should expect the rollback to take at least as long as the original transaction took, if not longer, because of parallelism.
The main thing we could provide here was information on how SQL rolls back transactions, help estimate the amount of time we may have left, give a gentle reminder about what to do next time, and talk about the only two options we had:
- Let the rollback finish – The log had the space, it was working through it, and activity in other DBs and tables was progressing – so they were up and alive – they just needed to figure the job out and make it work, and required the rollback to finish.
- Restart the SQL Server Instance – Recovery at startup is a parallel operation in certain conditions and versions -and it can make the rollback faster – But that DB is now offline with undo/redo running until done.
The Fix
The only real “fix” for a transaction in ROLLBACK (short of restoring to before the pain began and losing data elsewhere) is to sit back, grab some popcorn or coffee, and wait if the database needs to stay online for other transactions to run against it.
If this database can be offline for X hours, then a restart or SQL MIGHT be a bit faster, but only if you can deal with the whole database going offline during crash recovery. In this case, we ended up waiting – the log was OK, the client could wait for it, so we just waited with “watchful waiting.”
About 3 hours after the call, while we and the client periodically check in on log use and performance (now 9-10 hours into the CANCELED/ROLLBACK), I got another call from the client, the transaction rolled back successfully! Wahoo! No further interruptions to other transactions on the database, and the rollback completed successfully.
Lesson/Takeaway
While there is no guarantee we could have fixed this before the KILL command was issued, we would have been able to explain the worst-case scenario and most likely found what was holding up the job in the first place and adjusted it. And since this is a job that is “tricky and problematic” at times for them, it is begging for our experienced tuners and team to review and optimize. But at the very least, had we gotten the call before it was stuck in CANCELED/ROLLBACK, we could have done:
- An additional review of the “hung” transaction using sp_whoisactive tools and options to point on blocking/locking. We could have done some wait stat analysis to see whether we were in a bad loop in the code and “stuck,” or competing for resources.
- We could have looked to how long the job was running, queried around to figure out where in the proc it was and how much it worked it seemed like it had one (querying with NOLOCK hints, looking at status columns or logging, etc.) and looked to the duration, CPU, WRITEs for the session and determined what made the most sense – wait for it to finish or incur the pain and uneasy feeling of issuing a KILL.
Everything has been running as it should be since this scare. We still see the job running long (3-4 hours sometimes), and we’ve offered to help review it and prevent the next one. We also gave them a few tools to use before killing the next time.
Hopefully, if this gets stuck again, they will call us before trying to kill the SPID, or at least use the extra options in sp_whoisactive that we showed them on the call to help troubleshoot before killing the SPID.
The Straight Path Team and Skills
Mike Lynn was around that weekend, used a couple of our core values (generosity and an ownership mentality), and showed up on his time weekend and family time to help me out.
We educated the client and provided as much info on the what-ifs as possible. Mike even went out of his way to log in to the customer environment before leaving for a family event, and found the transaction had completed its rollback almost exactly when the customer called, saying, “IT COMPLETED!!!”.
