I almost didn’t get to write this post tonight about a lesson from my farm life for us DBAs. I could have been dealing with a massive headache or a hurt daughter – or worse. But. I get to write it so that you can learn from our near incident…
First A Question
How are you? How is your SQL Server estate doing these days? Can you restore? Do you know if you have corruption? When was your last restore test? How’s your service pack level? What sort of baseline do you have for perf? Free space?
Okay. So it was few questions.
What Happened Today
Alright. Back to my story. I drove into my barn today to get a round bale of hay like I usually do every 1-3 days for our cows (Seriously! Grass-Fed Beef means a LOT of grass!!!). I drove out and brought the tractor down to the farm gate so my oldest daughter and I could cut the strings, and I could drop it into a round bale feeder. Only this time, my daughter sort of ran over to me panicking. She was in the barn a moment ago. Long story short – a carrying beam in the basement snapped, and she noticed a big shake and sway underneath me in the basement (the cows live in the barn basement) the floor joists were now about a half-inch further away from the beam they are supposed to be tied into. Also – the carrying beam has a fairly substantial (as far as beams go) sag to it. I had a builder friend come by and inspect – and it’s not good. A few beams in the general area are finding themselves on the “bad” side of the good: bad scale.
I had noticed a little play in the floorboard before when bringing hay in and out – and it’s a 130 year old barn, so I’ve always been a bit apprehensive in the fall loading 20 round bales in there (650 pounds average size, though we have 1,000 pounders or 500pounders at times) and 600 square bales in a loft to the right. That’s about 20 tons, all told, and my tractor weighs about a ton driving the round bales in and out. It’s something I’ve sort of known about and thought about. I’ve even gone down and casually looked at the beams before and thought about maybe someday shoring something up just in case. But it’s been fine until today. The current batch of round bales is not in too deep yet, and today was the first time I crossed that beam in the front part of the barn – and the builder said it was a good thing I wasn’t going deeper. That floor may have given entirely out – and depending on how it all settled – the outside walls and metal roof and loft may have been okay or may not have been (I drive in the middle of the barn, the outside walls and loft spans are grabbing posts to the left or right of the spot where the crack is. But my daughter was down there staring up at the cracking beam, and I was on the tractor. It would not have ended well for one or both of us.
That was a close call. For the next several weeks/month I’ll be rolling out the round bales and getting a leg and core work out at the same time. I’ll also be exercising the wallet when I have a few issues fixed.
The Moral of the Story: DBA Edition
How often do us DBAs get so busy moving forward that we forget to check our foundations (or our structural support beams as it were) ??
It’s often far too easy for us to get caught up in one of a couple of bad patterns. We either get caught up in the routine (bringing hay in and out every day, the same way, with the same tools, without really thinking too much about it) or we get stuck on “new and exciting”. It’s easy for us to be killed by meetings. It’s easy to assume the environment is stable and secure and focus on the routine, or the fights with developers or checking out a new feature and trying to find a use case for it.
But as a DBA we’re only as good as the structure we lay down underneath us. We’re only as good as those structures are to keep us. At Straight Path, we do a lot of health checks for new clients. I’ve lost count of the number of DBs not ever backed up, the number of missing integrity checks, the lack of maintenance, the out of date and out of compliance versions of SQL Server, or the crumbling infrastructure that a 7-year-old laptop can beat in an IO performance test.
We must be vigilant. We need to go down into the basement and look at the beams under our database servers. And I’d say that, like my barn, the older it is – the more we should do that. Not for the reasons you think – not because old = bad. In fact, I’d say old oftentimes = better. But it’s because that “stability” we feel can often be false. It’s easy to get lulled into a sense of “well this has served us well all these years, what could possibly go wrong?” and give up on the checks. That’s why we have so many bridges in this country on “red lists…” It’s also why we get more “HELP! We’re up the creek and we don’t know what to do!!!!!” calls than “Hey can you look at this? We want to see if the foundation is solid and if we can get another 5 years out of this.” calls… We assume that because it’s there, it will last. It’s just not true.
So as DBAs – we need to change our attitude. We need to assume that our barn could give way underneath us, and we should periodically (and regularly) check on the things that we should be terrified of. I’m not going to make an exhaustive list right here, right now, but I’m working on a checklist PDF of sorts. Sort of an “Inspect/Maintain/Replace” list. If you own a car, you see the owner’s manual has prescribed maintenance periods. Something to do every 5,000, 10,000, 15,000, 30,000, 100,000, etc miles. We need that, too.
If you have some ideas, leave them in the comments, and I’ll work your feedback into the PDF. I’m not going to require an email or anything to download it. I just want it to be a useful tool for us. And I’ll try and publish it in the coming week or two in another post.
For Now – Get Started With Something…
There are some basic “non-negotiable” things you could be looking at regularly at the very least. And if you aren’t doing these things, you may be nearing collapse and not even realize it!
Get started with this homework while I work on my list:
Inspect
On a regular basis inspect:
- Are your backups being done? Each day. Do you have a backup complete?
- Are your backups any good? Either use the DBATools PowerShell scripts to do a test restore as part of your backup job (100% compliance, maybe much for large DBs) or randomly do a restore of a random DB (some compliance and better than none) or do some frequency in the middle – and restore a DB. Run a CheckDB on it. Did the restore work? Any integrity violations? Try a point in time restore and test your ability to do it while it’s calm.
- Free Space – how is it trending? Are you in danger? When will you be in danger?
- Performance – What’s your baseline? How is it trending?
- Make sure your jobs succeeded – In the past 24 hours what failed? can you automate the checks?
- Any errors? Errors can point out so many things. We’ve caught a virus in a client’s network that was a precursor to a pending ransomware attack and because of the high number of failed logins – we identified the client and the virus and they stopped the damage. While that’s way too late to discover it, it’s not nearly as bad as discovering it after the ransomware attack killed everything!
- Who has access? Check on perms periodically
- What is your version/service pack/CU compliance? If you aren’t up to date, and something bad happens and a CU or SP from 2 or 3 years ago could have fixed it, you’re kind of responsible right?
- Is Your HA/DR Solution Helping or Hurting? HA/DR is cool! A failover cluster instance or availability group is a neat thing to build. High availability is great! But if you are having unresolved errors in cluster manager – your solution could actually leave you down completely.
I could go on. I won’t. There’s more we can and should be doing. These things aren’t in your calendar. They aren’t in your new deployment meetings. They aren’t in the fun and exciting things. But we need to be vigilant. Entropy is real, and even your SQL Server deployment will trend towards an entropic state if left alone long enough. I’ll write more in a list and maybe a proposed schedule like your owner’s manual with a suggested calendar. Stay tuned here next week. I’ll have it by Friday 4/24. And now I wrote the date down, so I have to. Add any you must have on your list so I can shamelessly steal from you and add to the checklist. If it’s one that wouldn’t have been there, I’ll give you credit, too.