Interesting Issue Today/Yesterday –
A query was using a linked server to grab data from DB2 via an ODBC connection. Normal execution time is about 15 minutes. This had run on for at least 7 hours when it was decided to kill it, appeared to be doing nothing but causing blocks (due to Intent Shared and Intent Exclusive locks preparing to do an update if the data ever came back).
Waited a few hours, still didn’t die so we figured we could wait, if it took 7 hours prior to killing, rollback could potentially take at least that long if not much longer.
Came back today and it was still running.. By this time now about 30 hours since initial query executed and 23 hours after KILL issued. Process state was rollback.
When I looked at the stats for the session (SELECT * FROM SYS.DM_EXEC_REQUESTS), I noticed the CPU counter incrementing so it was at least thinking about doing something if not in the middle of doing anything. Rollback just didn’t make sense to be so long knowing what the query did.
Left me with few options though:
- Restart SQL Service – risk this database coming back up in recovery and not available at all
- Something horrible and drastic – Not a great troubleshooting methodology so put this out of my mind as a serious option… But rebuilding the log would probably have “solved” the immediate issue while potentially leaving a database in a transactionally inconsistent state
- Wait and hope – Safest but due to timelines of necessary processing was not a truly viable option.
So I called MS Support. Great assistance (as normal from SQL support) but confirmed what I was thinking: no help. They had no tools to undo this and had the three options I had above at first. They came back with a 4th I hadn’t considered:
- Take a full backup, restore it with a new name, verify everything appears right and cut over to that database (change name of DB with rollback operation, change name again of the restored DB).
The escalatio engineer had a good point: when you take the full backup that transaction that is rolling back was never comitted and would not be grabbed by that backup so the restore will leave a consistent and ready database.
Before going that route, I searched some more on the issue and after looking at what it was doing (again not a lot of work) and some common repeats of this issue (when using a linked server some folks reported hung queries that on rollback after kill would be in rollback state for as long as 7 days before they gave up and restarted). So I decided to take two of the options to mitigate the risk. Feeling confident that a restart would probably be the end all I:
- Backed up the offending database to use the 4th option from MSFT PSS if need be
- Went through the change notification process for the service recycle
- Restarted SQL Service
- Held my breath and
- Presto! The database recovered instantly in a good state
- DBCC CHECKDB run to ensure DB free of issues that would be picked up here
- Had users query data to ensure it seemed right
- Back in business
In keeping with the thought of mitigation, I will be looking into more behind the why, investigate the ODBC drivers and versions being used here, attempt to recreate the same issue in a different environment and remember this blog post is here if I need it…