SQL Server Blog

SQL Server Backups and the Illusion of Safety: Confusion

With the recent data center fire and missing government data in South Korea, it felt like a good time to continue with the SQL Server Regrets series of blog posts. We’ll talk about the confusion of thinking you are covered for recoverability – when you just aren’t. This “Regret” comes up far too often when a new client comes in with ongoing backup issues.

Note: It may be evident to a select few that the “regret” series was inspired by listening to too much Joy Division, New Order, and others like them. Each post ties to a song from NO/JD – some have more concrete ties than others. For “Confusion”, there’s not much lyrical depth to this single from 1983 that wound up on Substance, but the slightly chaotic tune, the “confusion” repeating often – it fits for the situation we get involved in when we hear from a new client who was under the false, confused assumption that their SQL Server backups were fine…

This is kind of call that the Regret posts is trying to prevent, but it’s just like one we get a few times a year:

We received an error message about corruption and session was killed or something, and our database is suspect and nothing works now. I don’t understand what this means. Can you help me? I tried to restart and ran some checkdb repair allow data loss thing, but it still seems off. I think the problem is bigger—the backups we thought we had aren’t really what we had, and I don’t think we can restore. Can you get our data back?

“You just can’t believe me when I show you what you cannot see,” is perhaps the part of the lyrics that speaks loudest here. So many folks are in trouble – and they don’t know it. They are in trouble right now. You could be in trouble right now – and not even know it until you are the one in the news. Let’s fix that – you can do it yourself. I want to focus in on two “confusing” phenomena here – “Rogue SQL Server Backups” and “Who is in charge?!” Then I’ll show you some tools you can use to make sure you aren’t in confusion. You don’t want to ever hear your users start saying, “You cause me confusion, you told me you cared….”

Rogue SQL Server Backups

When Jeff Iannucci on our team started building his sp_check* tools, sp_checkbackup was one I was looking forward to the most – because a very common situation we find with new clients is “Rogue backups.”

Maybe the VM team runs snapshots in Veeam or some other tool, the DBA has scheduled native SQL backups with Maintenance Plans, and the cloud provider has silently added their own “SQL Aware” snapshot that isn’t fully SQL aware after all. Three tools – three schedules – one broken chain that can mean no recovery – or at least delays and confusion in recovery.

When too many tools touch the same database, you end up with backups overwriting each other, log chains breaking, and restores that just don’t work the way you expected. The dashboard still says “success.” The job history looks green. But you are sitting there trying to match which backup was taken when, where the differential or log chain is – and if the log backups even restore.

If that sounds familiar, you’re already in confusion. If it doesn’t sound familiar, but you are told that your SQL Server backups are good because of the VMware backups or snapshots or some other team “helps” manage them? You may also likely be in confusion.

That’s why we built SP_CheckBackup, a free community tool that reads your SQL Server’s own backup history, checks your full and log backup chains, and shows you the truth. It doesn’t care what vendor you use or which job “owns” it. It just tells you if your data is restorable.

No sign-up. No form. Just a quick download, a query window, and the chance to end the confusion. Just do it – check your backups, understand all of the “folks” helping you take backups, and make sure you can restore when you need. It’s that easy.

When “Everyone” is in Charge – No One Is…

Here’s the part we see a lot of teams miss. When everyone is responsible, no one is accountable. “I thought the VM team had it the backups….” is a thing we’ve heard. I sadly even once heard a new client with corruption say, “But we pay for support and [software vendor] connects remotely for support, I thought they were doing our backups?!” (Their vendor recommended us – and their vendor never said they did their backups – in fact they actually have some documentation about how clients should do that thanks to us helping them write some best practice guides.)

The VM team assumes the DBAs have it covered. The DBAs assume the storage team does. The cloud team assumes “it’s built in.” And management assumes restores will always happen with point in time recovered with 0 data loss and 0 downtime (all while forcing a shoestring budget and not asking the right questions… am I right?)

Everyone assumes the backups are just happening because of these assumptions.

Until they don’t….

That’s why we test. In our DBA as a Service work, we don’t stop at “the job succeeded.” We suggest folks restore, verify, document, and assign a name. Someone always owns it. Someone always knows it worked. We work HARD to extinguish rogue backups – because the rogue backups often mean “no one is in charge” – and underneath a hodgepodge of extra backups and confusing lineage between backup files is usually a whole mess of other confusion that is either leading to known chaos right now or will soon head there.

You’ll Never Know Unless You Drill It

A backup isn’t a backup until you’ve restored it.

You can run jobs, take snapshots, or automate tasks all day long, but until you actually restore one of those databases, you don’t know anything and it’s during the restore phase that the confusion points suffer. So here’s a simple assignment:

Run a restore test this week. Pick one database, restore it to a different server, choose a point in time recovery option, and see how long it takes. See that you can do it. See that you have all the files you expect to have where you have them and you don’t have some other system taking the backups “helping you out”. You might be surprised what you find.

If you don’t have time, SP_CheckBackup can at least be a fast first step. It’ll tell you which databases are missing backups, where the gaps are, and which jobs might be out of sync. It will let you know where your backups are going and while it won’t tell you “who” is taking them all – you can see if you have only your expected backups or if you have a few flavors of backups all happening at once. Or it will tell you aren’t taking any. The scripts are community tools – they don’t phone home, they don’t cost anything – not even your e-mail. Take the few minutes to download it and explore today. I promise the time you save later (and the data you save if it’s bad) will be worth your few minutes today.

From Confusion to Clarity

Every year we get a few of those frustrating calls. A company hit by ransomware, corruption, or human error. The backups weren’t really backups, and the story turns dark fast. We always help, but it’s not always able to be the exact help they needed. If we had met them even just a day earlier, they’d be a very different story.

That’s why I wrote this. So maybe this time, we meet a day earlier. Maybe this time you cover the basics on your own before we have to meet “like that“.

Here’s how you can get started:

  1. Get SP_CheckBackup — no email, no catch. That link takes you right to the github – the above linked pages explain it more.
  2. Run it on your SQL Servers.
  3. Review the results and follow the links to understand what you find.
  4. Fix what you can. And if it’s about your lack of ability to recover your SQL Server databases – this is your P1 task – solve that before anything else this week.
  5. If you’re still stuck, reach out. We offer DBA as a Service help, buckets of hours, or a short 4 hour screen share only option where we can fix your challenges and mentor you on making sure they don’t come back. Reach out.

It’s far easier to remove the confusion today than it is to explain to your boss tomorrow how the data is gone forever.

Confusion isn’t coverage. Run the tool. Know for sure.

Mike Walsh
Article by Mike Walsh
Mike loves mentoring clients on the right Systems or High Availability architectures because he enjoys those lightbulb moments and loves watching the right design and setup come together for a client. He loves the architecture talks about the cloud - and he's enjoying building a Managed SQL Server DBA practice that is growing while maintaining values and culture. He started Straight Path in 2010 when he decided that after over a decade working with SQL Server in various roles, it was time to try and take his experience, passion, and knowledge to help clients of all shapes and sizes. Mike is a husband, and father to four great children and lives in the middle of nowhere NH.

Subscribe for Updates

Name

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share This