Keeping with the spirit of ripping lessons from headlines like I did in last week’s learning communication skills lesson from a police shooting, I think there are some lessons us DBAs can still learn from this FAA fire. All of the facts aren’t in yet, but even if it doesn’t pan out exactly like early reporting is showing it to, the lessons are still valid.
Recovery and Availability Reminders
HA != DR
High Availability and Disaster Recovery are different. HA is needed when you can’t stomach much, if any down time after a failure. Systems need to be back up nearly immediately, or not even go down. DR is for a really bad situation – in many cases it is gone to if a problem is beyond the reach of the HA plans and set up – a server room has melted down, a city is completely without power and connectivity. DR is also, in my mind, a mind set that every shop must have. Think about failures and how to respond.
To the FAA’s credit – this was a fire at a site. For most of my customers that is DR, not HA. But for some of my customers? If they lose a data center? Operations need to keep processing as normal under peak load. I would expect that of the FAA.
In your environments you can have HA or not have it based on your business needs. DR is something you should have figured out to some degree. Whether it is a cold standby someplace that you can restore backups to, an asynchronous Availability Group replica someplace else or knowing how to get to your offsite backups and how long it will take Dell or HP to ship you replacement equipment. Stop confusing the terms and work with the business to figure out what sort of recovery you need.
I like to build HA solutions (Failover Cluster Instances or AlwaysOn Availability Groups) with redundancies everywhere. Redundant switches in separate racks, cross connected to independent physical NICs on servers in separate racks and Power Units, with multiple power supplies from different power units each, etc. Basically? I want my clients to invest in even their HA solutions like just having a “simple failover” locally that is quick and mostly seamless is a rare occurrence that they want to avoid.
The CNN report linked above says:
After the fire, air traffic controllers initially had to manually transfer flight data that normally would be communicated by computer, said Doug Church, spokesman for the National Air Traffic Controllers Association.
Church said the fire damaged the telecom line that transferred flight plans from the airlines to the O’Hare control tower and then to the Aurora control center.
This is an early news story- maybe with inaccuracies – so this isn’t an attack on the FAA – but two things popped out at me in that quote that make good reminders of what I’m getting at here regardless:
- “the telephone line” – now this is filing flight plans for flights that are not yet in the air, I imagine – so it isn’t a life safety issue. But hearing “the” when talking about a really important link in any system is concerning. If you can describe any one important component in any part of your system using the word “the” and singular nouns – you may have yourself a single point of failure. That’s not necessarily evil, but you should make sure it’s what you expected.
- “the fire damaged..” – so this single point of failure caught on fire in a fire set by an employee who apparently lugged a gas can into the basement of the really busy and critical national infrastructure facility he worked at… Physical security is critical, physical fireproofing and protection is critical. Have you been in your server rooms? Have you gone to your colo? Have you interviewed these people who babysit your hardware?
Role Play Scenarios
Another quote in that CNN article talks about the airlines faxing flight plans over when that one phone line got a little melted. This worked – albeit it meant controllers had to double up and air traffic had to plummet to accommodate the slowdowns. In your environment? Think about and talk about the what if scenarios. “What if we failover to colo b – but our sharepoint system isn’t deemed DR worthy, but it contains our phone tree and all of our procedures for recovery?” ask a lot of what ifs, role play through failures. And if you have a step that sounds something like “Department A will just fax these things to Department B and stuff will work like normal” – stop and ask if that makes sense. Maybe in a small office it does. For Amazon.com orders? Not so much. Poke holes in your solutions and see what happens.
I don’t know if they tested the scenario of Chicago center going down and the one phone line that connected ZAU to the airlines down with it. Maybe they did normally. Maybe not. They either actually did an amazing job because they trained and if they hadn’t they would have had 5,000 cancellations – or they didn’t and they could have had less. Or they didn’t and someone would have done what the role play above was getting and and said “well.. let’s plan a redundant line this way – and let’s work on our fireproofing plans here and here” and not gone to blowing dust off of fax machines.
Failing Back Needs Love Also
I have probably spent more billable time helping clients fail back from their DR site after a disaster than helping them in the middle of the actual disaster. Failing back has moving parts, failing back requires thinking ahead of time. When you design your solution – before you select the path – think about how you are going to fail back after everything is fixed and happy. Will it require downtime? Will you have the connectivity you need to move VMs, databases and backups back to primary from the backup site? Have you planned and spent and tested right to ensure you can get back to production?
This can’t be overstated. Risks don’t just come from outside of your firewall. You can’t just say “well we have a firewall there so we are fine”… Internal threats occur. Amazing employees can have something snap inside sometime. Well intentioned employees can leave a door open. Viruses can come in on someone’s music collection. A competitor could be working for you waiting for a good moment. Problems can come from within and without. Security is as a part of HA/DR as anything else. And not just network security – but physical security. Some of the breeches out there in the news have happened because of physical security lapses or employees being social engineered. Review your security and make sure someone can’t literally or figuratively walk in with a can of gasoline and set your single phone line connecting you to important operations on fire.