Don't Splint Your Database Server To Death

A trauma patient can be “splinted to death.” So can a database server. It happens during at least one ambulance call each year and I’m sure it happens in many more data or network operation centers each year, too.

This post is my attempt to start back up with my “Lessons From Disasters” series I promised I would start awhile back. I’m intrigued by disaster preparedness and I have an interest in learning from history. We’ll take a look at some real disasters and find the things that we can learn as IT professionals in them. We’ll explore fields that deal with disasters and learn how to apply their training principles to our day jobs and we’ll hopefully search out some examples of “everything done right” to serve as examples to go after. I’m going to work on having one of these posts out a week – on Thursday.

Splinted to Death, Really?

Yeah. I still remember the EMT course I was in when I first head the concept – it made sense immediately. In fact it is beat into your head in the Emergency Medical Services (EMS) trough trainings and refresher courses in the form of the ABCs (well now the CABs) and the proper order of patient assessment. Most of the training and education standards in EMS come from data – really rich data collected and analyzed by states and the National Highway Traffic Safety Administration (the folks who maintain EMS standards in the US) They see what works and what doesn’t. They analyze the documentation done in the pre-hospital setting and then track the progress of patients in the system. One of the things they note is that folks in EMS had splinted patients to death. It still happens but less with the focus on proper assessment.

I keep using this phrase, what do I mean? Imagine this scene – you are on an ambulance crew and you arrive to the site of a car accident. You see that there is one patient and they are walking around screaming, “My arm! My arm! Help!!! Please, help, my arm!!!” you look down to the arm they are holding and instantly realize it isn’t setup the way it’s supposed to be. Obvious deformities, clearly broken – probably in multiple places. Every move the patient makes sends them into bone chilling, nausea inducing, scream worthy pain. It needs to be stabilized. The patient was up and walking – a great indication they have a pulse. They were able to scream – a good indication they can breathe (contrary to what a panic attack patient believes as they tell you over and over again “I can’t breathe” while they hyperventilate at 50 respirations/minute.) You are there to help and you can almost imagine their pain, so it makes sense to start splinting that arm and treating the symptoms… Have that patient lie down on the stretcher and get to work on the splinting. It’s going to be a tricky one to stabilize just right and it hurts every time you touch it, so be careful. Eventually, you’ll have a well splinted arm on this patient who has really calmed down as a result.

But you forgot to check vitals… You forgot to listen to lung sounds… The patient is calm because the patient is on the downward slope of shock – decompensated shock. Their body gave up trying to make up for the issues affecting circulation or cellular respiration. They are just about out of it, you can’t even feel a pulse at the wrists anymore because the blood pressure can’t get it there, you can barely feel it around the neck… You focused on that obvious, ugly, painful arm and ignored everything important. You missed the internal bleeding in the chest crushing against a lung. You missed a cruddy blood pressure and a rapid weak heartbeat. You missed they were already trending towards bad on the “good/bad scale”… Now maybe you noticed all this in time after the fact and maybe you didn’t. Maybe the patient survives another couple weeks and then dies of organ failure, maybe they walk away fine, maybe they don’t leave the emergency room alive. You’ve splinted your patient to death.

But We’re IT Professionals

Right. Right. I was getting carried away. We “splint” our database environments to “death” all the time! With no NHTSA governing how we are trained and how we operate, with no databases tracking outcomes, we probably do it a lot more than you see in EMS with human patients. I know I’ve blogged quite a lot about troubleshooting and many of these points even sound like a SQL Server Central article I wrote a few years back but I’m still bumping into this phenomenon. So this does apply to us and I think the EMS training and tips for dealing with the risk applies to us, too.

The Solution

The same solution in EMS actually works here:

ABCs – Spend time on the basics necessary for life before moving on. Power on? Services up? Machine pingable? The point is – the ABCs have to be the things that you can’t do without. Minor bleeding will stop on it’s own if you don’t fix the airway. You can restart the app server and reboot the client as many times as you like – if the SQL Server service won’t start you’ll never connect to it.
Rapid (Primary) Assessment – Once you check for and fix any ABC deficits (Clear airway, adequate ventilation, beating heart and immediately life threatening bleeding), do a rough assessment of the patient. Note anything found and fix anything serious. Anything critical in the error logs? Blocking chains? Nightly job running crazy? Someone change the app’s config file? Can you log in yourself? The point is – find and treat any remaining threats to life or limb, don’t get caught up buddy taping fingers – do take the time to splint and unstable pelvis that could damage critical arteries. Find and fix the things related to or likely related to the outage. Note the things you find that may not be optimal but aren’t causing this issue. The CIO wants those other things fixed, but they want the server up now.
Make Transport Decision – You’ve done the basic checks and you should know by now what priority this patient is. Need a helicopter? Closest hospital or trauma center? Do you need advanced licenses to meet you en route? Are we looking at an extended downtime? Can we try a few more quick things based on the info we have? Should we go live in the DR site because we need more troubleshooting time? Do I call in vendor support? Do I wake up the dev team? The point is – You’ve done some quick checks and quick fixes – what are you going to do next? Chat sports and wash cuts, hand out ice packs or get hauling with a critical patient – do you need more resources? Don’t get lost in this weird limbo state of “I don’t know what is going on and what I’m going to do next” – Move the situation towards resolution. If you need help, no one cares – they just want the system up. If you have to failover – do it and move on.
Secondary Assessment – So you’re heading to the hospital, you have a bit more time – go back and look in more detail at the findings you noted during the rapid. Fix those broken fingers, clean wounds, look for other hidden injuries.Fix that max memory setting so this issue doesn’t happen again, create the task to take Domain\Staff out of thy sysadmin role, Look for the tertiary issues related to the outage and resolve them. The point is – you want to double check your findings, look for additional issues and continue to make improvements/prepare the hospital to make improvements. Look for other issues, go deeper where you felt “off” about something but didn’t see any obvious and quick issues –Remember – the secondary assessment only begins after the first three steps – you have your system coming back to life, users can start to get in to prod or DR – if not then you are still fixing ABCs and primary items – there are some ambulance calls where the entire call and drive never leaves the ABCs because you are fighting an airway the whole way in.
Reassess/Monitor Vitals – How are your treatments working? Is the patient’s vital signs trending stable or going downhill? Are connections remaining? Can that first batch of users you let in get their job done? Are processes running and staying running? The point is – you need to monitor your interventions. You want to make sure that things aren’t getting worse again and if they are you need to act.
Cleanup, Documentation, Etc – The next call doesn’t work if the ambulance isn’t restocked, organized and ready. The EMS system never gets better if the calls aren’t documented. That patient’s medical team misses an important fact about the call if you didn’t document it. Lessons learned meetings (not blamestorm meetings) help us avoid the problem and improve the troubleshooting approach for next time.

If You See Something, Say Something – If something is wrong and you know it – say something! Don’t assume everyone else already knows.
Avoid Using Those Troubleshooting Skills – Acquiring troubleshooting skills is an important endeavor for folks. But what if you handled your environments in such a way you needed them less and less?
Best Practices: Explain and Understand Them! – I hate it when folks say things like “this is best because I feel…” or “I’m not sure why, but just always do this!” No one will follow your best practices if you don’t explain them!
Are You Planting Asparagus? – Asparagus can’t be picked for the first couple years after you plant it. It still takes preparation and hard work. Are you making decisions with the long term in mind? You’ll be less likely to find a situation where you have to splint your database.

7 thoughts on “Don’t Splint Your Database Server To Death”

Josh Feierman

February 2, 2012 at 21:31

Great post Mike! I too remember well those ABCs and how many times it kept me calm in the midst of chaos. When all hell’s breaking loose in the data center the focus on simple, regimented steps certainly helps too. I love the comparisons to the various stages of treatment. Look forward to the rest of the series!
- Mike Walsh
  
  February 7, 2012 at 08:51
  
  Hey Josh!
  
  Haven’t chatted in awhile, I figured you’d like this post having experience splinting both people and databases from your EMS days as well. Great great point there where you said “When all hell’s breaking loose in the data center the focus on simple, regimented steps certainly helps too”
  
  Really well said! Thanks for the comment.
Mike D

February 3, 2012 at 17:14

Great Analogy. I remember that deer in the headlites limbo a couple times from early in my career. As geeks sometimes the hardest things can be admitting “I don’t know.” and then calling in some help.
- Mike Walsh
  
  February 7, 2012 at 08:49
  
  That’s right Mike. It is tough to cross that hurdle! I was going to say the first time… But it can still be tough to cross it the Nth time as well. We think it looks like weakness, but I definitely think it is a strength to call in the help when needed. In EMS if you don’t “escalate” to a higher license when needed someone’s life is on the line and you can be liable for negligence. With our line of work, a database server dies.
Mikael Olsson

February 4, 2012 at 13:03

Great article! I print this out and hang it up on the wall in my work room.
Pingback: Can't Troubleshoot? Don't Apply! - SQL Server Consulting - Straight Path Solutions
Pingback: More Soft Skills From The Ambulance - SQL Server Consulting - Straight Path Solutions