Inspect Early. Inspect Often.

on April 16, 2020

in Best Practices, Professional Development

I almost didn’t get to write this post tonight about a lesson from my farm life for us DBAs. I could have been dealing with a massive headache or a hurt daughter – or worse. But. I get to write it so that you can learn from our near incident…

First A Question

How are you? How is your SQL Server estate doing these days? Can you restore? Do you know if you have corruption? When was your last restore test? How’s your service pack level? What sort of baseline do you have for perf? Free space?

Okay. So it was few questions.

What Happened Today

Lots of round bales. — A few round bales

Alright. Back to my story. I drove into my barn today to get a round bale of hay like I usually do every 1-3 days for our cows (Seriously! Grass-Fed Beef means a LOT of grass!!!). I drove out and brought the tractor down to the farm gate so my oldest daughter and I could cut the strings, and I could drop it into a round bale feeder. Only this time, my daughter sort of ran over to me panicking. She was in the barn a moment ago. Long story short – a carrying beam in the basement snapped, and she noticed a big shake and sway underneath me in the basement (the cows live in the barn basement) the floor joists were now about a half-inch further away from the beam they are supposed to be tied into. Also – the carrying beam has a fairly substantial (as far as beams go) sag to it. I had a builder friend come by and inspect – and it’s not good. A few beams in the general area are finding themselves on the “bad” side of the good: bad scale.

I had noticed a little play in the floorboard before when bringing hay in and out – and it’s a 130 year old barn, so I’ve always been a bit apprehensive in the fall loading 20 round bales in there (650 pounds average size, though we have 1,000 pounders or 500pounders at times) and 600 square bales in a loft to the right. That’s about 20 tons, all told, and my tractor weighs about a ton driving the round bales in and out. It’s something I’ve sort of known about and thought about. I’ve even gone down and casually looked at the beams before and thought about maybe someday shoring something up just in case. But it’s been fine until today. The current batch of round bales is not in too deep yet, and today was the first time I crossed that beam in the front part of the barn – and the builder said it was a good thing I wasn’t going deeper. That floor may have given entirely out – and depending on how it all settled – the outside walls and metal roof and loft may have been okay or may not have been (I drive in the middle of the barn, the outside walls and loft spans are grabbing posts to the left or right of the spot where the crack is. But my daughter was down there staring up at the cracking beam, and I was on the tractor. It would not have ended well for one or both of us.

That was a close call. For the next several weeks/month I’ll be rolling out the round bales and getting a leg and core work out at the same time. I’ll also be exercising the wallet when I have a few issues fixed.

The Moral of the Story: DBA Edition

How often do us DBAs get so busy moving forward that we forget to check our foundations (or our structural support beams as it were) ??

It’s often far too easy for us to get caught up in one of a couple of bad patterns. We either get caught up in the routine (bringing hay in and out every day, the same way, with the same tools, without really thinking too much about it) or we get stuck on “new and exciting”. It’s easy for us to be killed by meetings. It’s easy to assume the environment is stable and secure and focus on the routine, or the fights with developers or checking out a new feature and trying to find a use case for it.

But as a DBA we’re only as good as the structure we lay down underneath us. We’re only as good as those structures are to keep us. At Straight Path, we do a lot of health checks for new clients. I’ve lost count of the number of DBs not ever backed up, the number of missing integrity checks, the lack of maintenance, the out of date and out of compliance versions of SQL Server, or the crumbling infrastructure that a 7-year-old laptop can beat in an IO performance test.

We must be vigilant. We need to go down into the basement and look at the beams under our database servers. And I’d say that, like my barn, the older it is – the more we should do that. Not for the reasons you think – not because old = bad. In fact, I’d say old oftentimes = better. But it’s because that “stability” we feel can often be false. It’s easy to get lulled into a sense of “well this has served us well all these years, what could possibly go wrong?” and give up on the checks. That’s why we have so many bridges in this country on “red lists…” It’s also why we get more “HELP! We’re up the creek and we don’t know what to do!!!!!” calls than “Hey can you look at this? We want to see if the foundation is solid and if we can get another 5 years out of this.” calls… We assume that because it’s there, it will last. It’s just not true.

So as DBAs – we need to change our attitude. We need to assume that our barn could give way underneath us, and we should periodically (and regularly) check on the things that we should be terrified of. I’m not going to make an exhaustive list right here, right now, but I’m working on a checklist PDF of sorts. Sort of an “Inspect/Maintain/Replace” list. If you own a car, you see the owner’s manual has prescribed maintenance periods. Something to do every 5,000, 10,000, 15,000, 30,000, 100,000, etc miles. We need that, too.

If you have some ideas, leave them in the comments, and I’ll work your feedback into the PDF. I’m not going to require an email or anything to download it. I just want it to be a useful tool for us. And I’ll try and publish it in the coming week or two in another post.

For Now – Get Started With Something…

There are some basic “non-negotiable” things you could be looking at regularly at the very least. And if you aren’t doing these things, you may be nearing collapse and not even realize it!

Get started with this homework while I work on my list:

Inspect

On a regular basis inspect:

Are your backups being done? Each day. Do you have a backup complete?
Are your backups any good? Either use the DBATools PowerShell scripts to do a test restore as part of your backup job (100% compliance, maybe much for large DBs) or randomly do a restore of a random DB (some compliance and better than none) or do some frequency in the middle – and restore a DB. Run a CheckDB on it. Did the restore work? Any integrity violations? Try a point in time restore and test your ability to do it while it’s calm.
Free Space – how is it trending? Are you in danger? When will you be in danger?
Performance – What’s your baseline? How is it trending?
Make sure your jobs succeeded – In the past 24 hours what failed? can you automate the checks?
Any errors? Errors can point out so many things. We’ve caught a virus in a client’s network that was a precursor to a pending ransomware attack and because of the high number of failed logins – we identified the client and the virus and they stopped the damage. While that’s way too late to discover it, it’s not nearly as bad as discovering it after the ransomware attack killed everything!
Who has access? Check on perms periodically
What is your version/service pack/CU compliance? If you aren’t up to date, and something bad happens and a CU or SP from 2 or 3 years ago could have fixed it, you’re kind of responsible right?
Is Your HA/DR Solution Helping or Hurting? HA/DR is cool! A failover cluster instance or availability group is a neat thing to build. High availability is great! But if you are having unresolved errors in cluster manager – your solution could actually leave you down completely.

I could go on. I won’t. There’s more we can and should be doing. These things aren’t in your calendar. They aren’t in your new deployment meetings. They aren’t in the fun and exciting things. But we need to be vigilant. Entropy is real, and even your SQL Server deployment will trend towards an entropic state if left alone long enough. I’ll write more in a list and maybe a proposed schedule like your owner’s manual with a suggested calendar. Stay tuned here next week. I’ll have it by Friday 4/24. And now I wrote the date down, so I have to. Add any you must have on your list so I can shamelessly steal from you and add to the checklist. If it’s one that wouldn’t have been there, I’ll give you credit, too.

Article by Mike Walsh

Mike loves mentoring clients on the right Systems or High Availability architectures because he enjoys those lightbulb moments and loves watching the right design and setup come together for a client. He started Straight Path in 2010 when he decided that after over a decade working with SQL Server in various roles, it was time to try and take his experience, passion, and knowledge to help clients of all shapes and sizes. Mike is a husband, father to four great children, and a Christian. He’s a volunteer Firefighter and EMT in his small town in New Hampshire, and when he isn’t playing with his family, solving SQL Server issues, or talking shop, it seems like he has plenty to do with his family running a small farm in NH raising Beef Cattle, Chickens, Pigs, Sheep, Goats, Honeybees and who knows what other animals have been added!

First A Question

What Happened Today

The Moral of the Story: DBA Edition

For Now – Get Started With Something…

Subscribe for Updates

Leave a Comment Cancel reply