Alerting on Bad Parameter Sniffing Using Automatic Tuning

Alerting on Bad Parameter Sniffing Using Automatic Tuning

IMPORTANT NOTE: This requires SQL Server 2017 or newer, Query Store and Enterprise/Developer edition.

In SQL Server 2016, we were given the Query Store feature, and all the perf tuners rejoiced. Finally we had a great tool to help us with parameter sniffing issues. These days, any time I work on a parameter sniffing issue on SQL Server 2014 or earlier, I get mad that they aren’t using 2016 or newer, not because I’m lazy and just want to force a good plan but because Query Store can easily identify the problem query and I can dig into the good vs bad plans. I often will fix the problem by adding a covering index, but sometimes I force a good plan. On SQL Server 2014 or earlier, if I’m not fixing the problem with a new index, I might have to add a table hint or a query hint. If I can’t change the code to add those hints, then I might have to use a plan guide. GAH!! Been there, done that. 0 out of 10, would not recommend!

In SQL Server 2017, Microsoft enhanced the Query Store feature by adding Automatic Tuning which is only available in Enterprise and Developer editions. To get the collection of the Automatic Tuning data, you must have Query Store enabled on the database.

There is a wealth of information in this Automatic Tuning data, so much so that I wondered if a tool could be developed to proactively find and fix problems. You might be thinking that I must not know about the FORCE_LAST_GOOD_PLAN option, which allows SQL to force plans when a regression is detected. A client enabled that option and suffered through 4 hours of severe performance degradation when SQL forced a really bad plan. It took 4 hours before SQL reverted it. We were contacted after the event and promptly disabled the option since it caused such a severe performance issue.

What if we could leverage the Automatic Tuning data to find a performance problem without the risk of SQL forcing the wrong plan?

Utilizing the Automatic Tuning Data

After that incident, I dug into the Automatic Tuning data and decided to implement a job to detect bad parameter sniffing. The basis of the tool is the query from here.

Using that query, I added filters to catch the bigger issues as you don’t want to “fix” all of the rows as it’s taking away SQL’s ability to choose which plan is best. SQL picks the best plan typically, but sometimes it doesn’t.

I wanted to proactively catch regressed plans that are causing a high CPU event or that are noticeably performing poorly. I also wanted to catch the performance issue before our client’s customers reported an issue. These are the filters I decided on for one particular system. Another system might need other filters, but this is a good starting point.

SELECT *
FROM (
	SELECT planForceDetails.query_id, 
		(planForceDetails.regressedPlanExecutionCount + planForceDetails.recommendedPlanExecutionCount)
					  * (planForceDetails.regressedPlanCpuTimeAverage - planForceDetails.recommendedPlanCpuTimeAverage)/1000000 as estimated_gain,
		TR.reason, TR.score, JSON_VALUE(details, '$.implementationDetails.script') as scipt,
		planForceDetails.regressedPlanId, planForceDetails.recommendedPlanId,
		planForceDetails.regressedPlanCpuTimeAverage, planForceDetails.recommendedPlanCpuTimeAverage, 
		planForceDetails.regressedPlanExecutionCount, planForceDetails.recommendedPlanExecutionCount
	FROM sys.dm_db_tuning_recommendations AS TR
	CROSS APPLY OPENJSON (Details, '$.planForceDetails')
		WITH (  [query_id] int '$.queryId',
				regressedPlanId int '$.regressedPlanId',
				recommendedPlanId int '$.recommendedPlanId',
				regressedPlanErrorCount int,
				recommendedPlanErrorCount int,
				regressedPlanExecutionCount int,
				regressedPlanCpuTimeAverage float,
				recommendedPlanExecutionCount int,
				recommendedPlanCpuTimeAverage float
				) AS planForceDetails 
	LEFT JOIN sys.query_store_query AS Q ON planForceDetails.query_id = Q.query_id 
	JOIN sys.query_store_query_text as QT ON Q.query_text_id = QT.query_text_id 
	LEFT JOIN sys.query_store_plan AS regressedQP ON planForceDetails.regressedPlanId =  regressedQP.plan_id
	LEFT JOIN sys.query_store_plan AS recQP ON planForceDetails.recommendedPlanId =  recQP.plan_id
	WHERE JSON_VALUE(state, '$.currentValue') = 'Active' 
		AND planForceDetails.regressedPlanCpuTimeAverage/1000.0 >= 500 --bad plan uses at least 500 milliseconds for CPU time
		AND planForceDetails.regressedPlanCpuTimeAverage/planForceDetails.recommendedPlanCpuTimeAverage >= 10 --bad plan is at least 10 times slower than the good plan
		AND planForceDetails.recommendedPlanExecutionCount >= 500 --good plan has at least 500 executions
		AND planForceDetails.regressedPlanExecutionCount >= 200 --bad plan has at least 200 executions
		AND TR.reason <> 'Number of errors in the regressed plan is greater than in the recommended plan'
) t
WHERE estimated_gain >= 1000
ORDER BY estimated_gain desc, score desc;

Here’s a good example of a couple of queries that have regressed and should be looked at:

Example of queries that have regressed and need attention

Average CPU time over the last 2 days for query_id = 109259463:

Average CPU time for query_id = 109259463

Clearly plan_id = 118594051 is not a good choice for this query as its CPU time is averaging much higher than the other two plans.

Let’s Alert on This!

You can have the tool force the recommended plan or alert on it. Because of the severe performance issue that a client encountered due to the FORCE_LAST_GOOD_PLAN option, we chose to alert when it returns any data so that a human can make a decision. The monitoring tool that we use at all of our clients isn’t flexible enough to create our own alerts, but it will alert us when a job fails. I can easily fail a job using RAISERROR with severity 16.

/*
Get sp_ineachdb here: https://raw.githubusercontent.com/BrentOzarULTD/SQL-Server-First-Responder-Kit/dev/sp_ineachdb.sql

CREATE TABLE AutomaticTuning (created_date datetime NOT NULL, database_name sysname NOT NULL, query_id int NOT NULL, estimated_gain decimal (18,2), reason nvarchar(4000), score int, script nvarchar(4000), 
	regressed_plan_id int, recommended_plan_id int, regressed_plan_cpu_time_average float, recommended_plan_cpu_time_average float, regressed_plan_execution_count int, recommended_plan_execution_count int);

ALTER TABLE AutomaticTuning ADD CONSTRAINT PK_AutomaticTUning PRIMARY KEY CLUSTERED (created_date, database_name, query_id);
*/

SET NOCOUNT ON;

DECLARE @current_date datetime = GETDATE(), @cmd nvarchar(max), @query nvarchar(1000), @database sysname;

SET @database = 'DB_Administration'
SET @cmd = '

INSERT INTO ' + @database + '.dbo.AutomaticTuning
SELECT *
FROM (
	SELECT ''' + CONVERT(varchar(25), @current_date, 121) + ''' AS created_date, DB_NAME() AS database_name, planForceDetails.query_id,  
		(planForceDetails.regressedPlanExecutionCount + planForceDetails.recommendedPlanExecutionCount)
					  * (planForceDetails.regressedPlanCpuTimeAverage - planForceDetails.recommendedPlanCpuTimeAverage)/1000000 as estimated_gain,
		TR.reason, TR.score, JSON_VALUE(details, ''$.implementationDetails.script'') as scipt,
		planForceDetails.regressedPlanId, planForceDetails.recommendedPlanId,
		planForceDetails.regressedPlanCpuTimeAverage, planForceDetails.recommendedPlanCpuTimeAverage, 
		planForceDetails.regressedPlanExecutionCount, planForceDetails.recommendedPlanExecutionCount
	FROM sys.dm_db_tuning_recommendations AS TR
	CROSS APPLY OPENJSON (Details, ''$.planForceDetails'')
		WITH (  [query_id] int ''$.queryId'',
				regressedPlanId int ''$.regressedPlanId'',
				recommendedPlanId int ''$.recommendedPlanId'',
				regressedPlanErrorCount int,
				recommendedPlanErrorCount int,
				regressedPlanExecutionCount int,
				regressedPlanCpuTimeAverage float,
				recommendedPlanExecutionCount int,
				recommendedPlanCpuTimeAverage float
				) AS planForceDetails 
	LEFT JOIN sys.query_store_query AS Q ON planForceDetails.query_id = Q.query_id 
	JOIN sys.query_store_query_text as QT ON Q.query_text_id = QT.query_text_id 
	LEFT JOIN sys.query_store_plan AS regressedQP ON planForceDetails.regressedPlanId =  regressedQP.plan_id
	LEFT JOIN sys.query_store_plan AS recQP ON planForceDetails.recommendedPlanId =  recQP.plan_id
	WHERE JSON_VALUE(state, ''$.currentValue'') = ''Active'' 
		AND planForceDetails.regressedPlanCpuTimeAverage/1000.0 >= 500 --bad plan uses at least 500 milliseconds for CPU time
		AND planForceDetails.regressedPlanCpuTimeAverage/planForceDetails.recommendedPlanCpuTimeAverage >= 10 --bad plan is at least 10 times slower than the good plan
		AND planForceDetails.recommendedPlanExecutionCount >= 500 --good plan has at least 500 executions
		--AND planForceDetails.regressedPlanExecutionCount >= 200 --bad plan has at least 200 executions
		AND TR.reason <> ''Number of errors in the regressed plan is greater than in the recommended plan''
) t
WHERE estimated_gain >= 1000
ORDER BY estimated_gain desc, score desc;
';

EXEC sp_ineachdb @cmd, @user_only = 1, @exclude_list = @database;

IF EXISTS (SELECT * FROM dbo.AutomaticTuning WHERE created_date = @current_date)
BEGIN
	SET @query = 'SELECT * FROM ' + @database + '.dbo.AutomaticTuning WHERE created_date = ''' + CONVERT(varchar(25), @current_date, 121) + ''' ORDER BY database_name ASC, estimated_gain DESC';

	RAISERROR (N'Bad parameter sniffing detected. Please investigate by running this query: %s', 16, 1, @query);
END

If you want the tool to force the recommended plan, you just need to execute what’s in the script column in the output. I’ve left that step to the reader since it’s not what we implemented.

Setup the Prerequisites

Before you schedule this, you need to fix the rows returned by it so that it doesn’t cause the job to fail. The script should return zero rows before you schedule it.

You’ll also need to create sp_ineachdb and the AutomaticTuning table in your tools database.

CREATE TABLE AutomaticTuning (created_date datetime NOT NULL, database_name sysname NOT NULL, query_id int NOT NULL, estimated_gain decimal (18,2), reason nvarchar(4000), score int, script nvarchar(4000), 
	regressed_plan_id int, recommended_plan_id int, regressed_plan_cpu_time_average float, recommended_plan_cpu_time_average float, regressed_plan_execution_count int, recommended_plan_execution_count int);

ALTER TABLE AutomaticTuning ADD CONSTRAINT PK_AutomaticTUning PRIMARY KEY CLUSTERED (created_date, database_name, query_id);

Schedule it

Once you have the prerequisites in place, you can schedule it to run. We run it every 15 minutes which seems like a good balance between alert fatigue and finding problems before customers report them.

When the job fails because it found a regressed plan that needs attention, force a plan for it so that the job stops failing. Once a good plan is forced, you can take the time to come up with an even better solution, such as adding a covering index.

Subscribe for Updates

Name

3 thoughts on “Alerting on Bad Parameter Sniffing Using Automatic Tuning”

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share This