Have you lost your mind?!
What could I possibly mean by that title? I’m a DBA… One of the basic tenets of a DBA is “backup, backup, backup”, right? Well sort of…
Far too often I hear people asking questions like “will this setup allow our backups to be fast?”, “how is your backup process working?” or “how can we improve the performance of our backups?”. All great questions but I propose that they are the wrong question. We should be concerned with our ability to restore first and foremost. Database Administrators should be concerned with our speed of recovery, our reliability of recovery and our familiarity with recovery.
You’re just playing semantics!
Am I? I have a list of pet peeves a mile long when it comes to backups and this list is watered by experiences (my own and those I’ve heard about or witnessed). It is generally true that if your backups are fast your restores should be. It is generally a good thought process that if you are worrying about your backups you are worrying about your restores. Even still, I like to think of the purpose of the backup first. It’s to recover from a failure of some sort. Think about it in terms of restore and you’ll think of that feeling in your stomach when everything is down, the CIO is in your cube and the phone won’t stop ringing asking for ETAs. It’s a “pre-restore” that you are doing when you issue the BACKUP DATABASE… command. Thinking about it from the restore standpoint has allowed me to have a paranoid sense of how I handle my backups and how I practice my restores.
Did you say practice restores? You don’t actually do that, do you?
Yes, I do. There are a few best practices I try to work with when it comes to restorability and they may make me sound paranoid but they help keep me protected. If you were to study a lot of airline or other disasters you will see a common theme emerge: an apathetic attitude about a number of small things. An attitude of “well that screw isn’t really important to be perfect”.. And they are right, that one screw being loose may not be so bad but what if another person forgot another piece, and the pilot forgot a small piece, etc. Suddenly you have a series of “little things” turning into one big thing. I try to always remember that and be prepared for the worst, protecting myself against the unlikely but possible. Practicing your restores allows you to see where you could fail before the CIO sees you fail, it allows you to understand the steps and develop a checklist, it allows you to tweak your process and see if your restore time is acceptable.
Some of my “Best Practices” for recovery
I used quotes because I think the term is overused and these work for me, hopefully they work for you also. I fear saying “best practices” makes me sound like I am preaching and discounting any other practice. I am sure I will omit some, so I welcome any feedback to help. The point is, we are thinking about the what-if’s and that is better than living in ignorant bliss until something fails.
- Understand the needs of the business – you are a DBA, you can help shape a recovery strategy and help explain pros and cons of each strategy but don’t make the decision for the business. Hate to be wrong about something like this.
- Perform Backups – So you know the needs, well get on it and make sure you are backing up to meet the recovery needs and are allowing the same amount of or less data loss than they can tolerate!
- Practice your restores – Drill, baby, Drill. Schools are required to have emergency drills a number of times a year. This develops muscle memory in the kids and teachers, this allows them to know the way out cold, the steps to take cold. In the horrible event an emergency does happen at a school, they will be more prepared than having never practiced. Do the same with your recovery:
- Setup a recurring time to practice a restore or series of restores
- Make it a game, run through the scenarios that could happen (read the newsgroups or forums to find examples). Practice those scenarios and be sure you can handle them
- Involve others.. We all have dependencies on others. Get your organization to practice with everyone, more frequently than once a year.
- Time your drills
- Write down the steps you took and if something didn’t work, investigate and redrill to verify you fixed it
- Learn from the drill and constantly seek to improve the ability to use a skill you hope to never need.
- Understand Recovery Modes – Full, Simple, Bulk Logged. What do they mean? Look into them. This chart from SQL Server 2000 does a good job explaining. I’ll post more on these in the future. Understand them and use the right one for your environment. If you are in Full, practice a point in time restore. Practice applying several log files.
- Where possible backup to disk and move – This topic doesn’t work for everyone but I generally like to backup to a disk on that server first and have that backup moved to a tape server or Virtual Tape Library immediately or after a delay. Some will disagree and this has a cost but for most of the environments I am in this offers a few benefits:
- In the event of a failure that is not a complete server crash you can quickly access the backup locally – most of the restores I have ever had to do are from oopses rather than complete server meltdowns.
- In the event of an oops, you can quickly restore that file as a different database and change just the data you need or troubleshoot.
- One additional piece of redundancy built in
- Look at using file groups and partial restores – With the advent of piecemeal restores in SQL Server 2005 (see this document) you can utilize filegroups as a recovery method. You can at times restore just a filegroup and if you carefully utilize your filegroups (system objects in a primary file group, other objects in other file groups, for example) you can potentially make restoring a very large database a faster operation. You can also do some of these backups online.
- Look at reliability options – Investigate verifying your backups and looking at checksums. This books online article helps explain the option. It can add a cost to a backup timewise but can give you piece of mind as well.
- Document your process – Write down what you will do when things go south. Have a plan to follow. Ensure others on your team know the process and where to find the doc.
- Perform Maintenance – Are you running DBCC CHECKDBs on a regular basis? Are you keeping your indexes defragmented? Are you cleaning up your backup history? Do all you can do to keep things running optimally and prevent the need for a restore.
- Monitor your instances and error logs – If you aren’t routinely checking your systems, using a monitoring tool or process to find and report issues, you are asking to have a higher likelihood of doing a restore. More time in proactive mode = less time in reactive mode. Try it 🙂
What are some of your best practices? What are some of your horror stories about backups?
Well said. This attitude exists because those asking the questions you cite are more interested in placing a check box next to a bunch of words on an obscure document (a.k.a. "Best Practices"). I’ve been a facilitator for emergency preparedness meetings (for water utilities) and it’s something that is worthwhile for DBAs (and, really, anyone) to think about in their own jobs. Practicing is important no matter what your job is – especially when part of it involves emergency response in one form or another.
Thanks for the comment, Jon. Indeed practicing is important in every field. My favorite armed services motto is the Coast Guard’s: "Semper Paratus" or "always ready". You need to be ready for the unexpected. Even if you can’t practice everything, just think about scenarios and responses in a variety of situations. Run through them in your head and make sure you understand your approach.
Your manager will appreciate the effort and feel more confident.
I blogged about this back in September last year and have spoken about it often at user groups. I think the real problem is that people are naturally lazy and that software is not designed around recovery. You never here of a Recovery Product its not netRecovery its netbackup so some of the blame also lies with the way vendors sell it. http://www.sqlbloke.com/sql-bloke-blogs/2008/9/19/testing-backups.html
Thanks Bryan –
Just checked out your blog. Excellent stuff and I liked your point about documentation/run book. Missed that one but an important piece.
As far as vendors getting some blame, sure I can see that. I think it’s primarly apathy/laziness as you said but also people more worried about budgets, performane (on the backup side), encryption/security of backups and not enough focus on the restore side.
It’s our culture. We don’t worry about the what ifs becuase they don’t happen a lot. As for me and my house, I’ll try and be ready for the worst and hope for the best.