SQLU DBA Week – You Can Restore It! (right?)

SQLU DBA Week – You Can Restore It! (right?)

Welcome to my first contribution to SQL University’s Database Administration Week! I am going to go back to some basics here this week. I know you’ve probably been with us for a few semesters and you don’t want to hear that. That’s fine, Microsoft Certified Master Robert Davis is also covering DBA week and he may attack it from a deeper level. I want to hit some posts for those who may have cut classes in earlier semesters. Some of these posts go along with a presentation I enjoy doing (As A DBA Where Do I Start?!) and that presentation post has some good related links. Stay tuned to my feed to see what else we cover this week on this journey of baby steps. Enough introduction. Let’s get into it with our first post and a reminder that –

You’re Only As Good As Your Ability To Restore

I can’t say you’re only as good as your last backup – that would be a lie. I can’t even say you’re only as good as your last tested backup, another lie. You’re only as good as your ability to restore your database to the expectations of your customers. That means we’ll cover a few simple sounding (though sadly rarely practiced) concepts to make sure you are ready to keep your job when that database goes south. It actually starts before the technology in asking –

What Do Your Customers Expect?

Ouch. That means we have to talk.to.the.business… Sounds scary, depending on your experience until now. Now you’re a DBA, though. Part of this means talking to all sorts of strange people (Customers, v-v-v-v-v-vendors, sysadmins and even developers!) and working together with them to get the job done. When it comes to restoring  (I talk about restores instead of backups.. I think most people focus too much on their backups) you have to keep some measurements in mind to keep these customers happy, right? That means we need to be thinking about terms like –

  • Recovery Point Objective  (RPO) – Simply,  How much data are you willing to lose? To what point do your customers expect you to be able to recover your databases? That is your RPO. If you are taking a nightly full backup at 6:00PM then your recovery point ability is the previous night at 6:00PM if your failure happens the next day at 5:30PM. Does this meet what the business is expecting for an RPO? (Hint: you have to talk to them to find out!)
  • Recovery Time Objective (RTO) – How long will it take you to recover to that point we just talked about? How long should it take you? That is your Recovery Time Objective. Your business will have an idea of how long they can be without something in a disaster or need for backup situation. It sure would help to know it, right? This can definitely differ by business area like the RPO.
  • Service Level Agreements (SLA) – These don’t just revolve around restores and disaster recovery. There can be SLAs in place that define response time and performance characteristics but there should be SLAs for recovery. These are agreements between the IT professional and our customer on what we will deliver.

Those terms should suffice for this discussion. Once you know the terms, let’s put them together… Do you have SLAs today around recovery? Do you know what your business expects for RPO/RTO values? Can You meet them? If you answered yes to all of those questions and you really can answer yes to the last question then stop reading and go play a flash game. If you are a bit unsure continue on while we talk about a few steps to delivering recoverability –

  1. Talk to the business – Ask them what they expect in the event of a corrupt database or where-less update statement ruining data. How long do they expect you to shake uncontrollably in your cube while the CIO sighs angrily at you? (RTO) How much data do they expect to lose? (RPO)
  2. Analyze & discuss their needs realistically- If you ask an open ended question with no leading you may get the same answer for RPO and RTO – “milliseconds” That’s fine – do some research on how to accomplish those unrealistic goals and give them an equally unrealistic budget. Much like in a conversation about how many 9’s of availability they want, they’ll probably be willing to budge a little here when they see the dollar signs to deliver an ideal world.
  3. Discover what a reasonable number is – Balancing things like budget, application needs and available resources figure out what you can meet. Really understand SQL recovery models (that is a link to a books online article about recovery models. Rather than regurgitate it for you, go and really understand them. Click through the links. Come back and ask follow up questions here.) Write up what you you can meet and what you think you are haring the business say and deliver them an SLA.
  4. Obtain signoff – That is important. It means they claim they read the SLA. They understood your assumptions, risks, metrics, etc. They understand how much data they could lose, much you are saying they can’t lose and how long it will take you to get the database back up.
  5. Implement it – Setup the strategy and you are done, pat yourself on the back.

Woops…

Okay, I forgot a step. Remember what I said way back at the beginning? You’re only as good as your ability to restore to the expectations of your customer. So we did steps 1-5. If we did them right we should be good. But we all know that bold and italics together mean we’ve assumed wrong, right?

Right. We need to make sure our backups and strategies are useful… Some things to keep in mind here to help this goal –

  • Testing 1…2…3  – We must ensure our recovery strategy is good, otherwise our backup strategy is totally useless and our SLAs are a contract waiting to be broken. We should be verifying our backups as we take them, absolutely. We also must test our restores. Simple tests of one full backup restoring someplace are good. We should also test to make sure our strategy works and we are comfortable and have the muscle memory necessary to restore in an emergency.
  • Don’t Assume – I recently read a neat blog post about a DBA hero who was able to piece together a corrupt database file through writing a script to find and fix the bad parts of the text from an older file, previously knowing nothing about the structure of the contents of data files. Pretty cool, right? Well the tough part was this person didn’t realize there weren’t any good recent backups. They were someone else’s responsibility… Woops! Moral? The buck stops here. As a DBA if we can’t guarantee our ability to restore we should pack up our stuff and find a new job. Seriously.
  • Don’t delete before you take – How are your jobs setup? Do you first delete your old backups and then take a new backup? I’ve heard of a few cases where people have done this only to realize (in one case many many months too late) that they were deleting their backups fine but not taking good new ones. Why don’t you make sure a backup is good before deleting an old one? Change the order.
  • Alert! Alert! Alert! – It helps to know when a backup fails. There is no reason today to not either have a cheap homegrown solution (backup job alerts through SQL Agent, daily log file checks, verifying backup files exist, etc.) or an easily affordable SQL monitoring tool in to tell you when a backup fails. Again we need to spot check our restore ability but a good first step is at least making sure we have backups to test with, no?

Homework?!

“This is SQL University. It’s Free. It isn’t even accredited!  You can’t give us homework, Mr. Mike!” I am going to anyway. If you’re somehow a DBA you should have a recovery SLA of some degree. You should know that you can restore in the event of a disaster. So the homework is simple. It is enjoyable and the job you save through doing it could be your own.

  1. Analyze your recovery strategy. Make sure it agrees with the business’ ideas. Fix it if doesn’t.
  2. Monitor and react for backup misses (job notifications, time set aside to read the logs daily and report, installation of a monitoring tool, etc.)
  3. Test your restores – Take your latest backup and try and restore it. Does it work?

Extra Credit

  1. Create an SLA and get it signed
  2. Test a complex restore (point in time restore if you are taking log backups, for example)
  3. Document your recovery strategy with step by step instructions

Want to see what else we’ll talk about this week? Subscribe to my feed. Want to see what Robert Davis will talk about? Subscribe to his!

Subscribe for Updates

Name

Share This