SQL Server Corruption is a rather complex topic. As a consultant, I’ve been called in many times to help a client suffering from SQL Server corruption in some form or another. As a forum post answerer I’ve helped some pretty sad situations and tried to help. There have been times when I’ve reached out to the guru of SQL Server corruption himself for advice and tips, Paul Randal. This really isn’t a “back to basics” topic. Except it should be. An ounce of prevention is worth a pound of cure. Or, in this case, some proactive footing is worth your job, your company’s data or your client’s relationships with your firm. It’s that simple.
This post is part of the Straight Up SQL Server Tips series.
SQL Server Corruption – Two Choices
This really is pretty binary. You have two main choices when it comes to a corrupt SQL Server database. Sure there are exceptions, miracles, and expensive solutions. But to keep your budget intact, data initact and users happy you either can “Get corruption and try and fix it through a myriad of hacks, tools, advice, help, lucky breaks” or “Work to prevent the circumstances that can lead to corruption and have a plan in place to recover with the most trusted method – restores from trusted backups”. There is a world of consulting services and tools between those two options. But it can get expensive, there can be lost data, and guarantees fall out the window. Straight Path does a LOT of work in the eDiscovery space for the various SQL Server based vendors. In particular we do a lot of work with kCura’s Relativity product. Like any software vendor, kCura has a multitude of customers of various sizes. Some with DBAs some without. And like any other vendor, with enough customers sometimes a customer will end up with corruption. We get called in periodically to work with clients to help analyze, review and recovery from these situations. Sometimes we can help. Sometimes we could help but we can’t guarantee what data will be missing or what state it will be in. In the world of eDiscovery? That’t as good as “all the data is lost” sometimes depending on which side of the trial the data is for.
So. Always. Always. ALWAYS choose to be proactive and prepared. Don’t wait for corruption to catch you! When we do our SQL Server health assessments, seeing the findings that together mean you aren’t prepared for corruption is a huge red flag. Partially it is because as a consultant, I end up seeing corruption a lot – and it is always “after the fact” and usually from clients who either chose or, more likely didn’t realize they were choosing, the option with less preparation.
So this post won’t really talk about recovering from corruption. It will focus on prevention and preparedness. A follow on post will talk about some initial steps to do if you get a report of corruption.
Preventing SQL Server Corruption
This is a smaller category. 98% of the time or more, corruption is not SQL Server’s fault. You, as a DBA, likely didn’t cause corruption. It used to be that the filter drivers from tools like antivirus software could have posed a risk here, and in versions of SQL Server back in the 6.5 days there were ways to cause pain here. But really since SQL Server 2000 or maybe even 7.0 these were addressed – though pay attention to your SQL Server version, there was a bug causing corruption I spoke about in last week’s tip. Corruption is usually and expression of a storage system not keeping up. It can be other causes – but there is likely something not configured right in storage, maybe power failures on an improperly configured SAN, maybe poor cache setup, etc. It can be seen with storage subsystems not tested to handle the load expected and failing to keep up. So having best practices with IO and testing for performance and making sure storage is safe is key here.
Preparing for Corruption
This is the bigger category. One of my favorite blog posts to write was from many years ago. It was a two part series with the premise “if you haven’t planned to fail. You’re planning to fail” What I mean is – be ready for what could come. Be ready for failure and have a plan to get out from under it, and you’ll probably not end up failing. If you still fail while doing all the proactive steps to recover? At least it is a graceful failure and you know you did all you could do to get out from under it but the circumstances still conspired against you. This list isn’t exhaustive but it is a good start and I wish more clients at least did these basic steps:
Have a Solid Backup Strategy and Run Database Integrity Checks
If you go to any forum and say “Help!!! My SQL Server Database reported corruption!” the first answer (and best answer) you’ll get is “Restore to a known good point”. That’s a good answer. But it only helps you if you have a solid backup and recovery plan, if you look for corruption enough, and if you take the right steps when the SQL Server corruption error messages start popping up.
What is a solid “corruption resistant” backup strategy? Well it involves making sure you schedule (and review!) database integrity checks, which we’ll get to in a minute. And it involves taking transaction log backups with database that are in full recovery model. And it involves keeping those transaction log backups between database integrity checks.
Here’s why. Corruption can be discovered in one of two ways (more or less, there are actually worse ways to discover it also). When a corruption happens in a database, at least one page of data isn’t written to disk as SQL Server sent it down. The “contract” between SQL server and the storage subsystem was broken. Sometimes you’ll know right away because it is a crucial page. Sometimes you won’t know unless that page of data is accessed. And that can happen in various scenarios. Two we’ll focus on here – a user runs a query that touches that page of data is one. So you could have a large table which contains documents. A page could be corrupt. And users may never interact with that documents page for two or three weeks. Well it could become corrupt today, and you may not know about it for 2-3 weeks when a user finally tries to load that page into memory and they get a nasty error and call you. The other time is when you perform a Database Integrity Check – DBCC CHECKDB – when you run this, depending on your options, you are looking at every page of data and inspecting it, looking for corruption. (You can also check pages when doing a backup with checksum, but not all corruptions can be detected this way and restore verify).
So the right strategy involves you doing database integrity checks as frequently as you can. And reviewing the status of those. And keeping enough backup files to recover in the event of corruption. Makes sense right? You want to know about corruption, and you can’t trust the users to find it for you – and you don’t want to wait for them to.
So we agree? DBCC CHECKDB is helpful here. Now there are exceptions and gotchas with larger DBs, but I’m speaking to the 80% or more here.
Log Backups Might Save You in a SQL Server Corruption
If we combine running CHECKDB regularly with a backup strategy that includes log backups, we begin to potentially have a restore path. It’s maybe a bit clunky (though scriptable) but it will get you back from the vast majority of corruptions I’ve ever encountered. There’s a secret – well it’s not really a secret… The SQL Server transaction log keeps the transactions you are running on SQL Server. Lots can be said there – heck Paul Randal has a several hour Pluralsite course just on logging! – but when you do an update, insert, delete, alter, etc it’s generally going to be logged – written to the transaction log file. And in FULL recovery model – SQL server is going to keep that log record until you back it up – and then you’ll have it in the transaction log backup file. This is why you can recover to a point in time – you replay the transactions as they happened and stop where you want. Corruptions are normally (there are log file corruption situations but the are different) affecting the data file. The log file doesn’t contain the corruption, it just contains the action that you did but didn’t make it to the data file correctly. And you can restore log files even between full backups – they don’t reset the log chain – that’s why full backups only on a database in full recovery model will never mark the log file for truncation and folks get in trouble with log files growing out of control.
This “secret” can come in pretty handy. I’ll explain with a scenario:
Let’s say you take a full backup each night at Midnight.
You take SQL Server Transaction Log backups each hour.
You do integrity checks every Sunday before the Sunday night backup, and run the backup if the CheckDB was successful.
Let’s now say that your Sunday CHECKDB and Backup were successful.
Each day users were using the system, each backup succeeded all week. But on Thursday at 2PM a user says “I just got this error and my connection was terminated” you look at the error and you get that feeling in your stomach. Corruption!!!
Don’t worry. I mean sure worry. As long as you were taking regular log backups AND KEEPING THEM ALL WEEK between SUCCESSFUL checkdbs – you have a likely success path.
You don’t know for sure when that corruption occurred. You have choices. The first thing you should do is make sure no one is deleting log backup files or full backup files. Stop any cleanup jobs, make copies of those backups, you don’t want to lose them. Now this could be a corruption that is only on a nonclustered index and you can rebuild that and be good, but let’s assume it’s a big nasty one where data pages are zeroed out and there is no way to track them down.
Well now you can either just get everyone out of the DB. Take a tail of the log backup. Restore the Sunday full backup with NORECOVERY. Restore each transaction log backup between Sunday’s full and the tail of the log backup you created – and Poof! You have a database back to the point the user noticed corruption and you kicked users out. No data is lost. Just some downtime experienced while you restore. You can also try and restore full backups throughout the week and see if you can get a later full backup without corruption and restore log backups after that. The point is – we saved our backup files between known good checkdbs, we looked at our checkdb output and we sacrificed some space to keep our backup files – even our log backup files – a little longer just in case.
Remember though: a backup is only good if you can restore it when you need it. You need to test your restores. You need to run backups with checksum and verify options. Those options alone don’t mean you don’t need to test restore though. Get in the habit of testing restores. This will help your muscle memory for when you need to restore, and it will help show you if you have a problem with backups. The time to find out your backups have all been failing is not when you find out you have corruption.
Compare that with the sad state some clients have no choice but to go down with a corruption. You’ll thank yourself for doing this.
There are some other things you can do ahead of time:
- Setup SQL Server alerts. I blog about that here – but set up alerts, and subscribe to them – and look for some of the errors that are warnings before corruption about storage not behaving right. Errors 823, 824, 825. Be proactive in looking for challenges and issues before they hit you.
- Look to Availability Groups or Mirrors – some corruptions can be automatically fixed by what’s called automatic page repair – if you think about these technologies work by sharing log file data – so the subscribing replica or mirror may not have corruption.
- Keep your Windows servers up to date. Keep your SAN firmware up to date. Keep your power supplies happy. Or as I just discovered when a client reached out to me, keep your air conditioner units in the data center healthy.
- Setup alerting on your windows side and physical side.