It’s 5 AM on Wednesday. Your CIO has called your cell phone at home. “We need you to log in right now! Someone or something deleted all of the inventory. I tried to help out the team right before the inventory disappeared, and I was running some SQL commands I found and a couple of procs I remembered from back when I was a little more technical. Weird timing! But anyway, we need the database back to how it looked exactly at 4:30. And users will be on in the plans soon. This is all on you.”
So it’s the third day of “Two Weeks to a Healthy SQL Server” here at Straight Path Solutions, and this topic is not security (though, clearly that will be an important topic for the DBA and CIO above). It’s Backup and Recovery. Or, more succinctly, Recovery.
On Day 2, you hopefully were able to assess your SQL Server by running an inventory on your environment and taking the health check tool you prefer and start looking at the last good backups of your databases and the Recovery Models of your databases. Today, we’ll talk about those.
Backups don’t matter. Restores do.
Anyone can take a backup. That’s the “easy” part. Backups shouldn’t make you sweat or get even a little nervous (well, unless your storage performance is that bad). Restores, on the other hand, can be messed up in sundry ways:
- You can restore over the wrong database. In production. Without having a recent backup of the DB, you are restoring over (because you disabled the job for some reason while troubleshooting something.)
- You can have all of the log files you needed, except one.
- You can have a corrupt backup file.
- You can never have practiced a point-in-time restore before and are doing it today, with your manager and their manager in your cube staring at you for the first time.
- You need point-in-time recovery, but only take nightly fulls, and it’s the end of the business day, the problem happened a few minutes ago, and you have to explain to everyone why last night at 8 PM will have to be good enough – because it literally is all you can do.
I know these are potentials because I’ve been a DBA for over 22 years, and a few of those years were “trial by fire” years. I’m still here, so I came through alright but I still don’t know how many years some of those moments shaved off of my life. Learn from my mistakes, and let’s prepare before the disaster.
Shift Your Mindset
So the first step on the road to being able to recover is a shift in mindset. Right now, stop asking questions about backing up (well, okay, you can still ask the questions, but let’s make other questions more important.). Instead, focus on the restore. The whole reason we are taking backups is to restore them. That’s it. A simple shift in mindset here will help you visualize the result, start asking the right questions, and make the right decisions with that in mind.
Ask questions about the restore and the scenarios you’ll need to recover from. Make sure you have all of your files available when you need them. Do some doom-prepping in the data center, think about the various ways things will go wrong, and ask, “how can we recover from that?” (Corruption, Accidental deletion, Server goes bye-bye, Ransomware, what if something goes wrong at 1:49 AM, but then activity still happens?).
And as you start answering these questions and thinking of these scenarios – you’ll start thinking of solutions. I hope that by the end of these two weeks, the scenarios I just rattled off will scare you a bit less…
We’ll see this in Day 4, but having a Corruption Resistant backup strategy is a good start. Keep enough backups to get back to your last good CheckDB. It’s a solid start – and, sadly, puts you ahead of the pack already. Understand tail of the log backups.
Log backups, saved, and practice the muscle memory of point-in-time recovery.
Server Goes Bye-Bye
Runbooks + Backups stored someplace OTHER than the server your databases are on is a good start. Separate SAN? Separate device? Off-site up to Azure or AWS for pennies on the TB in Blob storage or S3? (or a proper DR setup?)
Off-site and protected backups, solid AV strategy, solid anti-phishing discipline, off-site and protected runbooks, unauthorized access restricted.
The 1:49 AM “Oops”
But “good data comes in after” because a restore is never just a clean restore. The users after 1:49 AM probably don’t want to have their data lost either. Now it’s point-in-time recovery PLUS knowing your data and getting the data right in the system. Perhaps restoring to a separate DB with _Restore in the name and doing some joined updates to fix the mistake. Good monitoring helps notice an issue before a lot of users do a lot more, too.
The point is – think about the problems you could face, and then think about how you’d solve them. If you know how – you’re ahead of the game judging by the kind of consulting requests we get. If you don’t know how to solve one of the issues – study it, learn it, ask about it, and know how to answer it.
Recovery Models, SLAs, Expectations
What does your business expect? And if they don’t really know, what do you want to say you can deliver? At Straight Path, our default is to say we can deliver point-in-time recovery with log backups every 15 minutes. Sure, the business may say that they are fine losing a business day of data in a certain system. Until they lose a business day of data in said system… So our job as DBAs is to ask the right questions and help solve the needs. I talk a LOT more about SLAs, RPO, and RTO in this blog post from 2011 for SQL University.
But really quickly:
SLA (Service Level Agreement)
The Agreement between you and the business about the Level of Service you are agreeing to deliver.
RPO (Recovery Point Objective)
How much data can you lose?
RTO (Recovery Time Objective)
How long can you be down? This one is a question that is more than just recovery; If the answer is < 1 second, you are out of just backup and recovery and into HA/DR answers. But you need to ask the questions and get an agreement.
Check out that linked post for more.
Once you talk to the business and sort out what they need, you need to assess your ability to meet their needs. And if the business does that teenager thing and shrugs and sort of hums “I don’t know” at you, then I’d give them our “default” backup strategy (below). Look at the reports from the data collection you did last week. Can you meet their needs or the needs you want to be able to meet?
Your business may be “fine” with data from last night’s full backup. Are you? What about if you have corruption and don’t do a DBCC CHECKDB each day. Most shops don’t do them daily, and on day four (the corruption day), I won’t even be telling you to do it daily – unless this is your backup strategy.
Elements of a Good SQL Server Backup Strategy
There are a few elements to a good backup strategy:
The Right Recovery Model
A SQL Server database has three options for recovery model: Full, Simple, or BULK_LOGGED. These recovery models are database configuration options set at each database. FYI: Whatever recovery model the Model database is in, all new databases will have by default. In short, these recovery models tell the database how to handle the transaction log. I go into a lot more details here in this post about the Transaction Log. For this post’s purposes, let’s consider the significant difference between Simple and Full.
In the Simple Recovery Model
You cannot take Log backups. This is because SQL Server is only keeping the transaction log records it needs to run crash recovery. Many folks think Simple = No transaction logging. This isn’t true. It would defeat one of the basic tenets of relational databases (see the post above for more). The transaction log is reused once a transaction isn’t needed anymore to recover because it’s safely stored in the data file after being committed. This is a fine recovery model if you are okay going back to your last good Full or Differential backup.
In the Full Recovery Model
Not only can you take log backups, but you must also take log backups. This is because, in the Full Recovery Model, you are effectively telling SQL Server, “I don’t just want the transaction log file for crash recovery like I have in Simple Recovery Model, but I’d like you always to keep the transaction log records until I back them up to a log backup, I want to be able to recover to a point in time if I ever need to.”
We still often find databases in the Full Recovery Model without log backups on initial health checks. This is because the out-of-the-box default for SQL Server is the Full Recovery model, which makes sense since point in time recovery should be what all good production databases want. Still, Microsoft doesn’t make log backups happen automatically by default, so the log files will keep on growing.
Put it Together
You need to take the RPO and RTO needs from your business (or yourself) and sort out which Recovery Model your database needs to be in. Then you need to figure out the backup schedule to meet the recovery objectives you’ve laid out.
Your first task is a simple “Where do we want to be? What do we have? What do we have to do to get there?” exercise. Once you have the right recovery model on your databases and know your recovery objectives, you can plan a backup strategy.
A SQL Server Backup Strategy
A good backup strategy is focused on the restore strategy as I’ve said a few times here. There are a few elements of the backup strategy –
We don’t need to go into that here, we covered it above. At Straight Path we default to Full Recovery Model as you’ll see below.
We have three to chose from:
- Full Backup – This is what it sounds like – a complete backup of an entire database.
- Differential Backup – These are all of the changes since the last Full backup (you’d need to restore a full plus the latest differential)
- Log Backup – The transaction log records since the full or since the last log backup. You need all of the log backups in order.
At Straight Path we default to using all three types. As you’ll see below.
You should think carefully about where your backups go. There are choices here. The biggest thing to consider, though, is disaster proofing. If you keep all of your backup files on the same host as your SQL Server instance, you’ve just put a lot of important eggs in one basket. If you put them on the same drive as your data or log files, it’s an even more delicate basket.
There are a few strategies here:
- Backup locally first and then copy to a backup share/network location/backup the backup folder. -> It’s nice having a local backup file for an “oops” restore and the peace of having a backup file someplace else is extra redundancy in your backups for the later restores.
- Backup straight to a network share. -> This can offload the writes to the NIC instead of the storage – a potential performance enhancement. It also takes your backup files offline immediately, so if something happens to the server after a backup, you are more likely to be safe.
- Backup to the cloud directly. -> There are a few tools for this; SQL Server supports backup to a URL, for example. This goes immediately to Azure. It gets your backup into a potentially globally redundant storage solution with mirroring, and you have extra protection for your backups. They are immediately out of your data center someplace safe.
- Backup to a third-party backup tool/device. -> There are numerous options here – too many to discuss. We’re okay with these if they are truly SQL Server aware, crash-consistent, and you are comfortable with them and have tested restores of all types from the tools (especially to a point in time.)
At Straight Path, with traditional backups, we prefer to either backup straight to a network share or backup locally and quickly copy the backups off. We’ll also go with backup locally and know they are swept to an enterprise backup tool immediately.
With databases less than 2 or so, we still prefer using native SQL backups and then having those backed up. The DBAs are ultimately responsible for restores. That’s one of the reasons we like the most recent week of backup files available to them without the need to involve the systems/storage/backup teams. But when the backups are immediately only available in a third party or off-site backup, restoring becomes a shared responsibility with more moving parts, which means more risk. With so many great options from storage vendors and backup tools, those are solid options as well. But you need to read, understand the solution, test the restores, and maybe test again.
SQL Server Backup Options
When you take a backup, you have a couple of options that just make sense to use. At Straight Path, we use these ones by default ourselves as well:
- Compression – Backup compression is worth it. For the performance and the size of the file
- Verify – This will run a RESTORE VERIFYONLY of the backup – this does not mean your backup is good to go, and you can rest assured your backup is 100%. But it at least rules out a serious issue and corrupt/bad backup file immediately. Basically, it’s a good first-level check, you will have some confidence, but a true restore is the only way to know your backup is good.
- CHECKSUM – This, like verification, gives another bit of confidence to the process. Again – doing a checksum alone doesn’t give you confidence that you are 100% good – but not doing one means you are starting without any base to start building confidence onto.
Next time, we’ll talk about Corruption (looking for it, preparing for it, being a bit resistant to it), and the backup schedule and retention schedule come into play there. This is really up to the business and needs, but our general approach at Straight Path is to try and at least do this:
Online (On storage we can access as DBAs without having to call anyone)
We want to see a weekly full backup, a daily differential on the other days, and log backups every 15 minutes with all of the backups (the full, every diff, every log) retained since the last good checkdb. This gives us the ability to recover from corruption at any point in the week to a known good point since that corruption is usually in the data file and replaying the transactions by restoring good full/diff. Every log since will usually get us back without corruption (there are exceptions, but in those cases, we still can get back to the furthest good point possible by retaining all of these backups.)
“Offline” (In a backup tool/off-site/etc.)
This really varies based on retention policies. In an ideal world, though, I’d love to see an additional week of log backups stored offsite and at least an additional week of diffs and fulls. Often your business has legally enforceable retention policies here (storing monthly backups for a year, yearly backups for 7 years, etc.), so this will vary.
Finally, the only good backup is the one you can restore. If you can’t restore – it was never a good backup. And with the backup schedule and options described above – you will be ahead of the game and probably ahead of at least 30% of businesses out there (sadly). But you are still not good until you know you can restore. To that end, there are a few things you should be testing:
Going back to the list of how things can go bad above. You do not want the first time you do a restore to be that one time you have to do it live with everyone watching. You need to test restores – and test them frequently – with the tools and scripts you prefer to use. We are huge fans of DBATools and the Restore-DBADatabase cmdlet. I also really like sp_DatabaseRestore, which our own Tara Kizer blogged about here when she was working for Brent Ozar.
Practice your restores. Full backup only. Full + Diff. Full + All the logs. Full + Logs + Point In Time recovery. And document what you learned. Don’t just test yourself once. Practice periodically to build muscle memory.
I told you in the backup options category above that those verification and checksum options aren’t enough to know your restores are good. You have to do a restore to find out if your backups were good. And with a week of log file backups and the Sunday full, theoretically (that’s something like 673 backups to restore) any one of them could have an issue for a worst-case scenario, which makes an argument for more frequent checkdbs if you can. You don’t have time to test every, single, backup file.
But you can at least start sampling them. You can test yourself and your backups simultaneously by running the tests described above on different backup files for different databases. Inside of the DBATools PowerShell module, a command called Test-DbaLastBackup does what it sounds like – it can test the last set of full backups for the server. You could create a job to test restores with these tools or your favorite script to confirm the scripts work. And in the process of building the jobs to test, you’ll learn about any flaws in your backup routine.
There are many options here. You could use maintenance plans. You could roll your own maintenance. MinionWare looks really promising. Here at Straight Path, we prefer Ola Hallengren’s scripts for backups, and it seems like we are far from alone there. In the homework, we link again to his scripts. Go check them out and really consider using them.
Once the classes are all outm I’m probably going to mash them together and create a few deliverables. One of them will be our typical Ola Hallengren maintenance approach and the options we like, schedules we try to default to, etc.
- If you haven’t started looking at the Ola Hallengren maintenance like the homework on day 2 had, now is a good time.
- Look at your database recovery models, your backup schedules, and your backup options from the inventory and assessment you did last week.
- Talk to the business, understand their RPO/RTO, and agree on what you should do. Look at the output in the step above – does it match? If not, what do you have to do to get compliant? (Recovery models? Backup schedules? Retention? Backup types?)
- Test some restores (PLEASE BE SO CAREFUL HERE!!! Do not test a restore over a prod database, try and restore to a DB with a new name on a new server even.) Test point in time recovery, test fulls, and diffs.
- Consider scripted ways to sample restores to make sure you know your backups are good the times you actually need them.
- Document your strategy.
- Start creating “runbooks” of your environment – using the inventory from Day 2 as your starting point.