After performing hundreds of SQL Server Health Checks over the years, we wanted to share the most common repeat offenders we find time and again. These are:
- Missing Backups
- Missing Integrity Checks
- CPU in Balanced Mode
- Max Memory Not Set
- No Alerting Configured
- Not Patching or Upgrading
- Server Config Items
This topic is part of our monthly SQL Server webinar series. Webinar registrants get access to the entire recording, which includes the office hours Q&A. If you would like to be notified about upcoming webinars, be sure to subscribe below:
Sign Up for Updates
Introduction
All right, it’s noon and we will get started. People will probably join us as we go. But this session is being recorded. So hello and welcome to the SQL Server Health Checks The Worst Offenders Session. So this is a mike Walsh from Straight Path. We’ll talk about us in a second and what we hope to to see. So make sure I actually change my slide right.
So who is straight pass solution solutions? Well, we’re a SQL Server consultancy, and we do two main things. We do managed services and we call it a senior DBA as a service, a a lot of clients use us to sort of either be their DB team or to sort of work alongside their DB team and help them out as sort of a force multiplier, you know, so the DB can actually take a vacation which, which is rare in the DB world, straight path never takes vacations, although although our team members do from time to time and then we do a lot of bucket of hours type consulting as well, where somebody can buy a bucket of 20 or 40 hours or whatever they need and we’ll do tuning help make things better, fix problems, deal with corruption, development, help whatever they need. Really, if it’s a SQL server, a DBA or or DBA aligned task on the team, we have a lot of different SQL Server gurus with different specialties. And actually we’re joined by quite a few of them here, not the full team, but we’re joined by, I have to count one, two, three, four, five, six, six, six, including me.
Six of my colleagues here at St Path are joining for the Q&A which will follow. We do a lot of SQL Server health checks, which means we see a lot of things that aren’t great. And we’re going to talk about those things today and we’re going to focus in on seven things that we just want to stop seeing.
And we’re going to give you the tips so you can prevent these things yourself. Who am I while I’m the owner straight past, I’m a founder of this company and I’m a father, a husband and a friend. I’ve been I’m old ish, I guess I used to have a full head of hair before start working with SQL Server.
It went away someplace. I used to have reddish beard, now it’s whitish. I’ve been doing SQL Server work since version 65 for about 25 years now, and I’ve done a lot of health checks myself, and I’ve been blogging since about 2008 and speaking at SQL Server events. And if you have a SQL Server Health Check question, obviously we’re a company, you can buy services from us, but I don’t care if you buy services from us today.
I just want to help you fix these seven things and prevent them. So if you sort of download the links, which if you look at the, the chat I put, I put a link there which actually has a blog post that I put out yesterday. And our blog is the most recent blog posts and it will have all the links.
We’re going to talk about all the links I already have in the slides are there so you don’t worry about writing things down or trying to screenshot the slides. We’ll go there as well and in a few days we’ll get the recording out here from this session there as well, and then we’ll have a Q&A section to follow.
But if you have questions or if you start doing a sequence or a health check on your own and you want to ask a question some, you know, emails. Mike PESCA dot com just let me know he came to the session and I’m happy to answer your questions and make sure you head off in the right direction. I’m not as I am the sales guy here, but I’m not a sales guy.
I want to help you with your with your SQL Server problems, whether you give us business or not and make sure we’re not sliding here. All right. So about the format is if some of us are on the strip pass, I just want let me know in my internal chat if you guys see about the format, just make sure the slides are advancing right.
That’s should be what you see to do too. It is all right and cool. And if you have questions, please go ahead and put them in the Q&A section throughout this. And our team will be reviewing these questions and we’ll have some time to ask the questions, answer the questions at the end of the session as well. So we have an hour.
The first 30 or so minutes are going to be this slide. We’re going to talk about the SQL Server health checks we do and the things we see. And then and then at the end of this, we’re going to have an office hour style conversation where we’ll talk about questions you brought up in the Q&A section or comments we have from health checks we’ve done.
We won’t out any of our customers for for problems with that that they have specifically. But again, we see a lot of the same things at big companies, small companies, banks, hospitals, small mom and pop shops. We see the same kind of problems from time to time. You could say. So what is a superstar health check? Well, my introduction was way back when some folks here might remember that that term, a SQL ranger, I think is what they call them.
That was basically the the sort of fees the premier field engineers who would come out on site and come and perform health checks or they call them raps or SQL Raps, rapid assessment, something program. Maybe they come in, they tell you what’s good, what’s bad, and then tell you what to do about it. So when we do a health check at Straight Path, we we kind of glance on the what’s good.
We’d like to give some praise, but you’re not paying us to give false assurance and and hope you’re paying us to fix, fix problems. So we want to point out the things that are bad. So we really zero in on a health check on what’s bad, how to fix what’s bad, and then prioritize the order of fixes for you.
Because a lot of times in health check, there’s many things wrong. Some of them are kind of important. Some of them are less important. And we want you to know which ones are which. So why should you do a skills or a health check? Well, pretty simple. You know, it’s best to be proactive and preventative and do some planning.
I like peas, I guess. And we want to keep your environment well tuned and running. You know, this is things that my doctor, my dentist, my mechanic, all the small engines that we use at my farm would actually love the reality of what we do with schools for health checks is people call us up, most of our customers call us up and say, help, something hurts.
Very rarely do we get a customer that calls up to say, Hey, our SQL Server is running. Okay, can you tell us what’s good and what’s bad? So that’s kind of how I treat my my teeth, my, my body, my car’s in the small engines. I kind of wait till something goes wrong, but in order to know what’s what’s broken or how to fix, you have to know what’s broken first.
And that’s what the health check does. So hopefully you’re here not with specific problems, but you just want to know how your SQL Server is. And that’s what this is for. You can do your own SQL Server Health check. Yes, we do services. Yes, we sell services. But on the links and again, all the links from these slides, just to reiterate, are going to be I’ll put into everyone I see that somebody shared it, but it was shared internally.
It just went out to everybody. There’s a blog post I have my blog from yesterday which actually have all the links in here. So you don’t have to worry about clicking these links, but you can do your own SQL Server Health check. And I have a lot of content on my blog about how to do it, and there’s lots of tools out there from the intranet.
Brent Ozar’s tools. Glenn Barry Scripts are great. There’s there’s things and DBA tools, PowerShell module that you can use. There are many, many ways to do these. The trick is to do it and to start focusing on something. So check out that link in the list of links and start looking at doing your own SQL Server Health check.
So before we get into it, I just want to have a quick admission. If you ever had a home inspection done when you’re selling the home, the home specters find sort of a few kinds of problems, and we do too. There’s actual problems like these, like your roof is falling apart and it has six months of life left.
That’s kind of a big deal. That’s a supercritical and going to make it. You’re getting different fix. That would be something like you’re not doing back ups, for instance, which you may laugh, but we see all the time important. So this is like, Hey, your electrical service is 100 amp and it should be 200 amp because your breakers are stretched out.
It’s important and it’s hurting a little bit. It could in the future for SQL Server, that’s going to be maybe your CPU is running in balance mode. There’s things we find that maybe aren’t that big of a deal at all. But I put this in the middle because the auditors care about it. That would be like, Hey, your SQL Server is listening in Port 1433.
I don’t really care about that because the people who are looking for the SQL service in your network, they don’t care report. They’re going to scan all the ports anyway, but some auditor someplace saying you have to do it. So we have to sometimes do things to make the auditors happy. And I know a few of our customers are here and I know a few of you are in industries where the auditors sort of tell you what to do and they kind of run the show.
There are some things are kind of a big deal, but not really a big deal. Like I can’t think of a good home inspection one, but, you know, maybe you don’t have a proper drip edge someplace on a gutter or something. I don’t know. It’s it’s it’s there. And we should look at it. And then there’s something that that’s something we should talk about planning like, hey, why are you still on SQL Server 2016?
And so the point though is there’s a lot of things that are subjective. The seven things you can talk about today are not subjective. These are things that if these are not good, your environment is not good. So what are they? Well, rather than list them here, this is the list we’re going to talk about today and the list we talk about on that blog post.
But missing backups, integrity checks, CPU’s running the balance, motor your servers running in balance power mode. Max memory is not set. Nobody’s watching your SQL Server. That happens all the time. You’re not patching, you’re not upgrading. You have bad server config or VM configure cluster config and we’ll talk about that as well. So let’s get into it.
1. Missing Backups
Remember, these are the findings that we want to stop seeing and the findings that you can actually make a difference more. Right now, I would be perfectly happy if we didn’t make money ever again because some we didn’t have backups because those are really hard conversations. I’ve actually lost track the number of times we’ve seen this and big companies and small companies really important companies are really important data as well.
I won’t name the industries, but you name the vertical and we’ve probably seen them missing backups. It’s actually made me fearful as a consumer and a person who does banking and gets gets checked up from time to time from the doctors, just surprising the folks who don’t have good backups. But it’s not just missing backups. It’s also not understanding backups.
We see a lot of people that wind up with what they think are backups because they say, oh, don’t worry, the scan team is doing our snapshots or the VM team is doing our snapshots, but they’re not actually taking skill aware snapshots. And then when the the bad moment, when the bad stuff hits the fan, they can’t recover because they have what’s they have a backup that doesn’t actually restore they restore the snapshot and that up with a corrupt database because the and the FS and LDF stored data files and log files are not lined up.
So that’s kind of kind of a bad thing and we astray path believe you have to have what we call a corruption resistant backup strategy. And again, all the links, just to reiterate, are on that post, but really a corruption resistant backup strategy we’ll talk about in the next slide when we talk about not doing check DBS, but if you have databases in full recovery model, which means you can recover to a point in time and you take full backups, we like differentials, too.
So maybe a weekly fall in daily devices. Our typical backup strategy that may or may not work for you depending on your data and your server needs and then log backups and keeping all of those backups between good check. DB So not you don’t want to remove your backups until you know you have a good healthy check. B If you do that, you can recover from corruption with minimal and potentially no data loss.
Really no data loss. You might have some downtime. We do restores, but you’re going to be in a much better world. We’ll talk about corruption in a moment. Unfortunately, we we see that a lot of you attending this may never see corruption in your SQL Server career. It’s very possible to not see corruption if you work in one place for your whole career.
And yeah, pretty simple environment and your your storage is perfect and your machines are perfect. You never have power failures and no bugs from SQL Server windows are your storage drivers and you’re not using this VM tool and this storage tool and this snapshot tool and all these components which can lead to corruption. We see a lot because we’re consultants and it’s never good when we see it.
Sometimes our answer to customers is cool. We can fix it. Sometimes it’s How is your backups? And based on their answer, our answer is sometimes while you’re going to lose some data, it’s just a question of how much, which brings us to integrity.
2. Missing Integrity Checks
So we want to always be checking the database integrity. You know, as frequently as we can, we tend to do weekly, you know, to go back a slide to frequency really determine is determined by your backup, how many what you retain for backups.
I’m trying to say your backup retention. So if you retain backups for a full week, all your backups including are log backups between good check beast and a weekly cadence is fine if you don’t have the space for that. I would fight hard to get the space for that and say, Well, why don’t you have the space for your backups?
It’s really important. But but but you, you want to make sure you’re doing these checks. So we do them weekly for our clients, probably 95% of the time. So disk check TB is a process that will actually look at every single page of data, every eight kilobyte page of data in your SQL Server database. And we’ll ask it, are you corrupt?
Are all of your bits where they’re supposed to be? And if they are, you get an error message in your check. DB And you’re in to restore land if you are doing a corruption resistant backup strategy. And there’s three ways you’re going to find corruption, one of them is doing a check. DB And that’s how we prefer it when we ideally we’d never find it, but we prefer to find it doing a, B, you might catch on a backup if you’re doing backup with checksum, which we recommend.
And in the links that you’ll see on the posts on our blog, you’ll see information about doing backups and we do recommend backup a checksum. You might find a corruption in the backup process. The checksum fails. Not always though. And then the time when we see corruption. Most often when people call us up and say, Hey, can you guys consult for us?
It’s not typically with number one because a lot of clients aren’t doing check to be. It’s not because number two, because logs on chart using checksum, it’s often number three where a user’s running something and they get this error message and it’s not not a good time. So do check DBS that that’s the thing we want you to do.
Again, we’re happy to tell you. But on the health check that we do. But I would rather you find the corruption first and you do the check. TB So we can actually help you with the, the more advanced and, you know, fun things to do.
3. CPU In Balanced Mode
Performance; I just got done doing a health check for a really, really big company the other day and they’re having some performance problems and wouldn’t you know it their SQL server is CPU is running imbalanced load so so their power config for their SQL server is set to balance mode, which means I want to save some, some kilowatt hours per year or per decade.
And the way it works is, is basically Windows says, we’re going to we’re going to downgrade. We’re in a down level, your CPU performance. And if we see you, you needing it and you start to use it and your performance peaks and it peaks for a little bit of time like a second or maybe it’s milliseconds, but if it peaks up, then will jump your, your, your CPU throughput up.
We’ll actually use what your what you paid for your CPU. The problem is your SQL servers and we’ve seen this sometimes, but your SQL Server should not just be pegged at 100%. It’s going to be up and down very spiky. So quite often what happens is SQL Server gets busy and then it starts going off the cliff and it’s not busy anymore.
And now the CPU says, Here you go, here’s all the gigahertz you needed so I don’t need anymore. Thank you. So we’ve seen clients improve 10%, 15%, even even 20% by just clicking control panel system, power configs or typing in power cfg. That’s all. I type it in my sleep and going from balanced mode to high performance mode, it’s free, it’s easy.
It doesn’t need to restart. It doesn’t require any approvals as far as I’m concerned. I mean, if you’re working for a big bank, you probably have to get approval in six months of paperwork before you can do it. But it’s a free setting with no downside. So fix that before we do a health check for you. Here’s another one I just saw yesterday for that same customer.
4. Max Memory Not Set
They didn’t have their max memory set. Anybody here? It depends on your age. You might remember Ron Popeil and his infomercial. Somebody here may not even know what infomercial is, but around Popeil was this guy who had infomercials for all sorts of things. One of them was the rotisserie chicken. And in his rotisserie chicken infomercial, he talked about how it was and the whole audience would shout it back, set it and forget it.
And SQL Server, a lot of people think is set it and forget it because you can put as they put the disk and I’m showing my age now if I’m not already showing my age of my bald head, of my beard. But you download the ISO and run the install and you click next, next, next finish and you have a SQL server and then you install your QuickBooks application or whatever application one install and poof you have a production SQL Server database running on your SQL Server that nobody did any best practices on.
One of those things that comes out of the box is Mac’s memory. It’s gotten better and later versus people you can actually it gives you some some adjustments for these things. I’m pretty sure actually my team has to go and speak at everyone. I’m you can tell I’m a CEO now. I have not installed SQL Server 20. I don’t think I can install SQL 2019 myself.
I know what ask you about Mac. Stop. I know it fixes temp TB does the installer. So we just make sure you and everyone from the panelists does the installer. Now even ask us about Mac’s memory? I think it may, but you have to still know what you’re doing and click it. It does, Davis told me. It does so.
So it does, which is good. But you can still end up with it set to unlimited or to two terabytes, which is effectively unlimited for most customers. So we have customers and more memory than that and servers, but it’s unlimited not a good out of the box default e to go in there and check that again that’s a free setting doesn’t fixing it doesn’t cost you anything to do it do note though that if you make a change to to this, this will have to run a reconfigured behind the scene.
So if you change your Mac screw parallelism cost restore parallelism maximum re optimizer out of workloads certain settings you change in SQL Server runs are reconfigured behind the scenes and these settings will dump things that are in memory. So depending on your memory settings, I don’t know if this will always dump things, but if you are not using all of your memory, it may not.
But if you are using all of your memory, you’re going have to take something out of memory. So you’re going to be potentially forcing plans and data out of cache. So this will be a little sluggish performance after you make the setting, but it’s still a free setting that you should do.
5. No Alerting Configured
The next thing you want to stop seeing are unwatched SQL servers.
No, no. The whole thing we do with our DB as a service is we watch your SQL servers and we bring a couple of tools in which do daily health checks and an alerting. But so many SQL service have nothing like they’re just sitting there and nobody’s watching it unless somebody is complaining. We see jobs in the ceiling for for for eons.
So maybe you are doing backups, which is great. Good for you. But you’re backup jobs and failing for the past three months we’ve had clients like that before and nobody knows because they’re not getting the emails and it’s relatively easy to set up a single agent email or to have an email when a job fails, it’s relatively easy to set up SQL Server alerts to at least know if your SQL Server is throwing Severity 17 or, God forbid, severity 25 errors on a daily basis, or sort of giving you warning signs of corruption 820 threes and 24 errors.
So you should be setting up at the very minimum SQL Server agent alerts. You should be setting up emails for jobs that really matter. And when I was a production DBA way, way back when, in the days before my, my, my employers would spring for monitoring tools or DB as a service companies to help me out. I would set up one email from my I did daily because I didn’t know the benefits of differentials back then, but I would set up a daily backup job to send me an email either way.
So every day, every morning I would have an email saying, Hey, your backup succeeded. So at least I know I’m getting alerts and if I didn’t see an email, I would know, hey, the lack of an email about a failure could just be that. My, my, my email program’s not working or my my email is not working. So I set up alerts.
I’m sick and tired of seeing that too.
6. Not Patching or Upgrading
Patching and upgrading, especially nowadays with ransomware, with with viruses, with zero days, with all of the things being fixed in SQL Server because the surface area of SQL Server has increased and everybody uses it differently and there are bugs, there are things missed and. Q A It happens even with Microsoft.
I know that’s shocking to all of you. If you are using Internet Explorer, you should be patching your SQL servers. So we’ll talk about upgrades separate in a second here in the same slide. But we often see I mean, we just had a client that was SQL Server 2008, I think our two but our team like a few months ago.
So so many things have happened between patches. If you don’t apply them, we prefer you to stay current but not super current. We do not want you to just download every SKU the moment it comes out. The one exception might be if it’s a really critical security update, and Microsoft says, you better do this because a zero day out there, you really have to do it.
If you have a test environment shopping, you do do with their first and wait and see but we prefer you to wait maybe maybe one patching cycle. So maybe don’t be in the latest. See you up in the latest Q minus one so the last see you so and see you 14 comes out deploy so you 13 now otherwise a 13 that’s an unlucky number let’s say 15 and 14 deploy deploy the one before because if you’ve been paying attention to the SQL Server community and blogs for a while, you’ll see that I can probably think of 4 to 6, maybe even more.
And people might, might even count more and type in the everyone chat. But there’s been at least 4 to 6 kind of not small oops as from Microsoft where they released a patch and they said, wait, wait, wait, don’t deploy that patch, wait a second, deploy this one instead. And if you did deploy that patch, uninstall it or deploy this fix to fix the patch.
So I like to just wait and let somebody else’s customers experiment with the brand new patches. And again, if you get into a ransomware situation or auditing situation, this is the kind of thing they’re going to ask you. So it’s like the port number that we talked about, a few slides ago. We all in this call know that change in the scope of report number doesn’t really do anything to stop anything.
But the auditors might make you do it because something they have to do and then you have to defend it. And let’s say you have a bad situation or somebody hacked your system, you have to not answer in a courtroom. Why didn’t you do this best practice? Well, same thing with patching. If you have an issue and you’re not patched, that’s sort of your liability.
It’s your company’s liability, but it’s also your liability as far as your job goes and as far as all that go. So so you should be patching.
7. Server Config Items
And the last one that we want to stop seeing are really misconfigured VMs, misconfigured servers and misconfigured clusters. In the office hours. I’m going to ask Sandra to talk about some cluster oops is that we’ve seen even recently we saw our customers recently have a fail over because they never had a file or a witness record.
We often see VMs not set properly. I’ve lost track now of the number of times I’ve seen SQL Server Standard with 12 sockets, one core per socket and eight of the sockets not available. Therefore they’re only using four of their 12 CPU’s. They’re paid for because they didn’t realize SQL Server standards limitation and because they just click next in their VM console and nobody, nobody check these things.
So I’m not saying you’re all bad if you make these mistakes. It’s just these are the kind of settings and configs, these seven that we’ve talked about, including this one that that we see all the time and have lots and lots and lots and lots and lots and lots and lots of documentation. Blog posts entire sessions at pass dedicated talking about this and other SQL Server conferences, SQL SATs and user groups.
So these seven things are really well advertised things that you should not have an environment and we don’t mind finding them in the health checks and fixing them with you. We just would rather use your time to fix the things that are not so easy. The things are a little bit more complex to find and this is one of them.
So server config, VM config, cluster config, it can go deep, quick. So we’ll find misconfigured cluster sometimes when it comes to timeout settings and other other things. But those are not what we’re talking about. We’re talking about really basic things, like you built a cluster without a core or file share witness like the course. Those things really should, shouldn’t exist anymore.
It’s so that that’s it. Those are those are the seven things we want to stop seeing on our blog post. I put in the chat a couple of times. We have links to all of these things we talked about and in the coming days we’ll actually have this recording there. And then based on we’re going to talk about in the queue at any moment, we’ll add more links.
So we have lots of links about all of these things. I’m one of the people who’s been blogging about all of these things for a long time, and I didn’t mention so many things like this. Even this sort of running on list here at the bottom is a small list, you know, DB compatibility levels in your databases in proper file groups found that can matter because it can really hurt performance instant file initialization disabled.
I found that yesterday and the same customers talking about the had max memory set the default which is unlimited effectively they had bad CTF PHP, they had clusters were parallelism. Their cpu’s are in balance mode and they have no no non clustered indexes, which is a different problem that’s more advanced. It’s not that advanced, but it’s more deep because you have to know what indexes and but improper trace flags are data and log files and same drive errors that have been there forever that nobody knew about stack dumps quietly happening behind the scenes that don’t quite restart sequel.
But your SQL Server is not healthy auto shrink are we see that still shrink jobs running auto closed people who think that and my relativity clients will know this people who are told by their software vendors, oh, rebuild their indexes every day will help. It doesn’t help. So these are things we we see on health checked as well and if we had 5 hours I give you 60 things and talk about all of those some we talk about quite a few of these are on our on our blog as well.
And we actually have a post coming up next week from from David C set our team where he sort of takes what we have is a basic health check checklist and he really expands on it with more things. So he talks about some of these things as well. And again, take take our advice, take our wisdom. We’ve learned the hard way.
Everybody here and the panelists that which we’ll talk about in a second. We all have learned SQL Server from the School of Hard Knocks. So rather than try and learn it from the school hard, knock yourself, go ahead and fix that. And by the way, this slide is not February 2019. I like that. I just noticed that. So that’s you can tell I copy from an old slide show, but this is not presentation worse practices.
So that’s that’s the main part of the session if you have questions deep into the Q&A and we’re going to switch over to Q&A mode in a second, but we’re going to just jump to office hours. I’m going to have the panelists turn the cameras on and stop sharing my screen and we’ll just sort of discuss what we use, what we talked about here.
Thank you again for attending next month. You’ll see on the blog post today and I’ll put that post up once more. Right now we have a registration link for next month’s topic, which which is about ransomware and security and kind of asking the question of where does it fit here? And in my mind, the answer is kind of a twofold answer.
If if the DBA is worrying and if they’re the first person at your company worrying about ransomware and virus protection and security, there’s probably something wrong your organization with the security, right? Because somebody else would be worried about that first. But I also, you know, this book here, it’s always on my bookshelf is, you know, extreme ownership, but it’s a book we read here straight past and the book was there.
I didn’t I just happened to be supposed to get it. But it’s it’s everybody’s responsibility, right? If everybody said it’s not my responsibility, then I mean, just nobody responsibility and it gets missed. So there are things that we can do as as computer users, as email users, as, as, as it’s citizens to help prevent and lower the surface area for attacks.
Also, there are things as a DB we can do and we can lobby hard for to try and make sure that in the event that our endpoint protection and our perimeter gets breached and anti ransomware does get into our environment, that at least we can get back quickly. So things like run books, things like offsite backups, offsite, offline, off network backups, we see many people ransomware attacks have their backups destroyed even though there different data center, but they’re on the same network.