People are always saying things about the importance of data, like how it is the lifeblood of a business. It sounds cliché, but there’s a lot of truth to that. You and I can’t survive without any blood, and there’s a good chance your business wouldn’t exist for long if they suddenly lost all their data.
This blog post is intended to help you make sure you and your organization are properly prepared for disasters. And I apologize; that opening paragraph even grossed me out a little. I promise I’m done talking about losing blood.
Now, let’s talk about disasters and other mayhem out of your control. Most often, this conversation revolves around “acts of nature” like earthquakes, hurricanes, ice storms, and similar events that cause power outages or equipment failure. There are lots of other things that are just as likely to happen, like server failures, data corruption from storage issues, ransomware, and bad cumulative updates. And let’s not forget that time Pat in accounting accidentally ran a DELETE without a WHERE clause. Oops.
You can’t know when these things will happen, but if you’ve had any experience with databases, then you know they will happen eventually. And you can prepare for resolving any of these by having healthy, current backups of your databases ready to be restored at a moment’s notice. You probably know this, but…you may not be doing those backups correctly. At least not well enough to successfully recover from any of the previously mentioned disasters.
Here’s a quick checklist of the 6 most common mistakes related to backups that we see.
- Not using the right recovery model
- Not using the right tools
- Not verifying the backups correctly
- Not moving backups off-site
- Not backing up keys
- Not testing restores
1) Not using the right recovery model
There are three recovery models, and only two of them are related to point-in-time recovery. By that, I mean your data can be restored to the point in time where things went sideways, or at least very close to that moment. Only Bulk-Logged or Full recovery models allow for your database to have proper log backups, allowing for point-in-time recovery. (And by “allow,” I mean you need to take those backups to avoid the log file growing until it or anything else on the same drive can’t grow anymore, which would be a kind of avoidable disaster of your own making.)
If your databases are in the Simple recovery model, though, then you don’t have that juicy and delicious point-in-time recovery option. You only have full backups that you take every day, week, or whenever. You or one of your colleagues may have set the recovery model to Simple thinking it means “no logged transactions,” which is actually false. Transactions are still logged; they just clear after committing. What Simple recovery model effectively means is you are totally cool with losing an entire day (or more) of data. You might want to double-check with the decision-makers in your organization on that one.
And it’s not just your recovery model that can be problematic.
2) Not using the right tools
There are lots of nifty third-party applications that back up your servers with SQL Server. Unfortunately, many of them are not aware (or care) that your servers have SQL Server databases. Your systems administrators may love them, but these tools are not the backups tools your databases are looking for.
If these are just file-level backups that stroll through the drives sequentially backing up files, then they won’t work for your SQL Server databases. First, they probably won’t tell SQL Server that anything was backed up, so it won’t have any idea these backups happened. More importantly, though, if your data and log files aren’t backed up at the same moment, then when you go to restore the files, your database will likely be in suspect mode. Suspect mode is like Zombie mode: your database isn’t dead, but it’s not alive, and it can’t be queried. Fortunately, unlike actual zombies, the database can’t make other databases go into suspect mode. Unfortunately, like actual zombies, it probably isn’t coming back to the correct state anytime soon.
(I know this is a professional blog, so please don’t get distracted by my use of the phrase “actual zombies.”)
Anyhow, there are other tools that use the Volume Shadow Copy Service (VSS) for SQL Server backups. These are…OK. While they will allow you to have point-in-time recovery, they can cause unexpected performance problems by freezing all database I/O whenever a “snapshot” backup is taken using VSS. Don’t worry, the documentation says it should take…less than 60 seconds.
“The snapshot creation phase is fast (less than 60 seconds), to prevent blocking of all writes to the database.”
I don’t know about you, but even a few seconds of I/O freezing will cause other applications to freak out and fail. If your budget allows, there are alternatives that provide block-level backups at the storage level, which can work more efficiently. And again, you need those to be able to synchronously back up your database and log files, so they aren’t out of synch. Or sync. Or whatever abbreviation for “synchronization” you prefer.
As weird as it seems, for most budgets using good old native backups is the best bet. And with all the work Ola Hallengren has put into his scripts and jobs that use native backups, there are plenty of options to configure database backups for nearly all SQL Server databases.
Let’s talk about one of those configurable options now.
3) Not verifying the backups correctly
By default, all databases created in SQL Server over the last decade or so have CHECKSUM page verification enabled, which is one way to check for data corruption. Simply put, a checksum is created for every data page and is included in the header when the page is written, and then when the page is read, a new checksum is calculated. If the new checksum matches the one in the header, all is well. If not, SQL Server will flag the page as corrupted.
You can do the same thing with your backups by enabling the WITH CHECKSUM option with your backups. By reading all the data pages after writing a backup and verifying the checksum values match those at the time the backup was written, you can be reasonably sure your backup file wasn’t corrupted during writing. Yes, backups take a little longer with this option enabled, but that won’t seem nearly as long as the time you might have to spend writing a Root Cause Analysis to your boss (and their boss, and their boss’s boss) explaining why a corrupted backup file won’t restore.
And here’s another thing you might want to never have to explain.
4) Not moving backups offsite
Ransomware is a scary thing, and unfortunately, the Straight Path team has had to help a few organizations that were impacted by it. When someone with bad intentions gets into your network and encrypts everything they can find, then that encryption often includes not only the database files but also the backup files too. That is not a place you or your company’s checkbook wants to be because your only options could be to pay the ransom or close the company.
The best defense against ransomware is this: offsite backups. If you have your backups out of harm’s way, then you can go about the business of ignoring the ransom and just creating all your servers and databases and everything else from scratch. This sounds nasty, but in the era of virtual servers and all the tools available to quickly create a new server, this may not be as difficult as it sounds. It may be time-consuming, but at least you won’t have to look for a new job.
Do yourself and all your colleagues a favor and copy your backups to an offsite location. Please. We’re begging you. Set up copies in another data center, public cloud, heck, even a tape backup if that’s all you’ve got. Just make sure they are somewhere a hacker can’t access.
While you’re at it, there’s something else to back up and copy offsite.
5) Not backing up keys
Microsoft has made enabling Transparent Data Encryption (and really, any encryption) simple enough to enable. The TDE feature encrypts your storage at rest, meaning not only can someone not use copies of your data files, but they also can’t make a usable backup without a copy of the encryption keys.
Ironically, some organizations fail to make their own backups of the keys, which means if they did have a disaster, their backups would be useless. If you don’t have the key, or a certificate containing the key, then your encrypted backups are protected from hackers and everyone else, including you. It’s a feature, not a bug.
Check the dynamic management views on your SQL Server instances (start with sys.dm_database_encryption_keys) to see if your databases are encrypted, and if so, back up those keys and put them in an offsite location so those mean and smelly hackers can’t get to them.
After you’ve done that, there’s one last but very important thing to do.
6) Not testing restores
Although this is the last item on the list, it’s definitely not the least important. There’s that saying about the best-laid plans of mice and men and how they often go awry, and if you’ve resolved all the previous issues, you still could end up like a mouse without your cheese. It’s one thing to have a plan, but it’s quite another to have actually attempted to execute the plan.
Or, as former boxer Mike Tyson so eloquently said, “Everyone has a plan until they get punched in the mouth”. No, I’m not going to talk about blood again. I promised.
Lots of folks have “a plan” to recover from a disaster, but going over the processes outlined in any plan and attempting to restore databases and query their data is the way you go from thinking you can recover to knowing you can recover. It is also highly conducive to a good night’s rest.
If you take nothing else from this blog post, take this: please develop some cadence for the practice of restoring your databases. Pretend you got hit by ransomware and ask, “now what?” It’s the easiest way to find any gaps in your current processes or plans before an actual disaster. Quite frankly, it’s the way to find out if you have committed any, if not all previous mistakes. Better to find them now than when some disaster eventually occurs.