So maybe you read the post title, and you already have an idea where this post is going… In the heat of the moment, I didn’t. Long story short – we had a client who was having some ugly excessive autogrowths causing a nasty cycle. Their autogrowth wouldn’t complete within the timeout period for the connection causing the growth. So the autogrowth wouldn’t finish and rollback. And there was much rinsing and repeating. We solved it. The team quickly identified the issue, realized the 50GB autogrowth was an issue, grew in smaller chunks, detected the activity that looked like an API call stuck in a loop that caused excessive growth, and we got out of the loop. But then we spent a lot of time troubleshooting that slow autogrowth. By “we” on the “quickly identified and solved the crisis,” – I mean the rest of our team at Straight Path. By “we” on the troubleshooting, I mostly mean me 😉
There are a couple of reminders for me in what we just did to help these great customers out. Maybe they’ll help you out.
Transparent Data Encryption (TDE) + Instant File Initialization = No Dice!
I know this. I’ve known this. I’ve told people about it when answering a forum question before. I’ve gone through the MSFT documentation before when screenshotting a setup for clients. I’ve included it in the write up for folks we’ve pushed to TDE. That was one of the significant issues for this client. They had recently implemented TDE within the past 6-9 months, and this was the first sort of big growth event during a perfect storm of what seemed like some faulty API calls stuck in a loop, causing the growth. Long story short – we changed the default autogrowth from 50GB to a number that finished well before the timeout period for the queries causing the increases.
A reminder for us – actually forget it – there are at least three here –
- White space isn’t evil. It’s okay to manage your DB files with some white space to “grow into.” Even with Instant File Initialization as an option, there are cases (like with TDE enabled!) where this isn’t sufficient.
- Test. Do some growths even with IFI enabled and see how long the growth takes. Maybe there is some other issue with AV or windows processes that can cause the growth to be slow.
- TDE + IFI = Bad. If you need encryption, know you don’t get the benefit of fast autogrowths. Normally this is an “okay” situation – but remember that a user transaction that needs to grow a database is waiting on that autogrowth to complete. So set your growth increments appropriately if you do this – so you don’t have your users waiting on an auto-growth.
Treat the Patient, Not the Monitor!!!
But wait! There’s more! Quite often, with a technical reminder like this, there is a professional development/life skills reminder that goes along with it for me. Well, “here’s your sign…” In my “outside of work” life, I’ve spent maybe a decade as an EMT/EMT-I and even Paramedic student in the local fire service in the small towns I’ve lived in. I’ve finally let my EMS licenses lapse over the past half-decade or more and am “barely” still on the Fire Department with my crazy schedule making me miss most of the calls… A lot of who I am as a technologist and business owner and person came from all of the training and work in that space, though. There are so.many.lessons for us technologists from the emergency services. Here, though, I fell victim to something that I’m embarrassed to admit! I treated the monitor… Not the patient. This server was on an Azure VM. It was an AG built out of Azure’s defaults some time ago – and the creator had put all of the AV extensions there. We noticed a lot of “extra activity” involving all of the various Azure AV and Antimalware “stuff.” It was high. And the autogrowths were taking a LONG time. So, I started going headlong into what the monitor was telling me. We started working on trying to fix the AV exclusions and tone down the autoscanning. We all know (hopefully we all know this!!) SQL Server likes certain exclusions on AV. So it made sense. The tooling was a bright beacon to it. At one point in my head, I thought, I should create a new DB, make a huge data file, and grow it and see it grow. I was ready to do that. I knew I wanted to because something was off here, but the task manager and resource monitor’s disk activity stole my eyes. After doing a lot of work on the AV/Malware/etc. (and finding a disk DEFRAG JOB RUNNING RANDOMLY!!!! – The client didn’t recall adding it, maybe something in the default setup here they did four years ago – a lot has changed in Azure over the past four years 😉 ) we still were seeing slow growths. I decided to just try that little experiment. And the.moment.the.new.db.grew. Quickly I said in my head, “SHOOT. TDE…” You see I knew that the new DB wasn’t TDE protected. New DBs aren’t by default. And the moment I could go 300GB with 0 seconds of growth time, I thought for about 3 seconds. “Wait. What’s different? Why was that one slow? The other DBs are in an AG, but that doesn’t matter here, why was this one fast, and it’s new. OH CRUD!! TDE!” and I had my answer. We solved it. We got there. And we saw some other issues that were causing slowdowns, but I’m also mad at myself for wasting about an hour of my time and the team’s time on that call for not going their first. I even told the client I wasn’t going to bill for an hour of everyone’s time because I should have sniffed that out sooner.
In the Emergency Medical Services world, there’s a concept of “Treating the patient, not the monitor.” In the back of an ambulance, we have lots of tools to tell us about a patient. We have a cardiac monitor that gives us a quick look at the heart’s rhythm, letting us know about dangerous situations. We have 12 lead EKGs to show us a more in-depth look at the heart rhythm and possibly show us signs that point out ischemia suggesting a heart attack in progress. We have blood glucose monitors. We have capnography to explain the chemistry involved in breathing. We don’t have as many tools as they have in the ED – but in emergency medicine, there are a TON of tools. These tools are amazing. They help get diagnoses faster. They lead to differential diagnoses, and they get treatment plans underway quickly and help providers realize if they are making a positive impact right away or not. These tools are ESSENTIAL to the fantastic numbers the trauma and emergency medicine profession have seen over the decades of long term survival after injury or serious medical emergency.
But the patient is still the center. A human still listens to the lung sounds to perceive a subtle difference. I always felt a pulse with my fingers, even though a few of our tools give us a pulse rate – you can’t tell if it is strong/weak/thready/bounding necessarily with the machines. You still have to take the patient’s history. You have to get them to state (in their own words, a few times) their chief complaint and why you are there today. Sometimes the patient shows you their problem way before the monitor does. Sometimes the patient can show themselves crossing from compensation to decompensation before their sudden blood pressure drop sucker punches your adrenaline into overdrive. And when you pick up on the cues from the patient, you often have a bit more time to do something about it.
It’s the same thing with SQL Server. I mean, not the same thing… We have great free tools like Query Store, DMVs, Perfmon, scripts, etc. We have third-party tools for monitoring. These tools are essential in our DBA toolkits. But we also need to add to those tools our judgment. Our wisdom. And all of the available data inside and outside of the monitoring tools.
A few thoughts for us Technologists here –
- Step Back. If you find yourself with tunnel vision on one chirping and blinking metric, it may be the one sign that will solve the issue and save the day. It may, however, just be a false alarm sending you in another direction. Step back and assess the whole server. Ask history. Understand why you are there and look at all the data, weighing the blinking/shining/noisy alerts in as just data points, not neon arrows to follow above all else.
- Work it out aloud – Ask your “partner” on the issue “what do you think of this? I see this, but this is happening, where should we look?” reason together and sort it out. Talk about the data. Evaluate and form a plan aloud, and you’ll likely shoot down your ideas and miss less.
- State and test your hypotheses – Had I called out that “we should create a DB quick and see what happens” – I bet someone would have done it even if I was stuck in monitor land looking at the blinking alarms pointing us in one direction. Had that happened, I probably would have questioned the blinking since it didn’t affect that DB, and I would have gone back to “basics” and said “OH CRUD! TDE!” an hour sooner.
- Don’t forget the basics – Most of the problems we encounter – in SQL and the Ambulance – are of a type and variety we’ve seen. Sure the minor details will be different. But there are only so many problems out there. What’s that saying, “History doesn’t repeat itself, it just rhymes”? I’m not saying there’s nothing new under the sun – that’s for a longer post on another medium – I’m just saying start simple. Rule out the simple things first. The things your own eyes and perception tells you. Use the monitor to test and go more in-depth, but most of our ambulance calls were not episodes of that TV Show, “House.” Most SQL Server issues are a fundamental problem or series of primary and unfortunate events cascaded. Find the basics and move on.
Remember – our “patient” is the SQL server. Not the monitoring tools. Use the tools to help us. But don’t get lost in them.