Here’s today’s issue; how do I monitor/capture information about query timeouts on my production server? This issue crops up from time to time and I never seem to have the time or energy to devote to it. We don’t have many problems with timeouts on our production servers, but when we do the DBA should really know about it sooner rather than later. What follows is my latest solution to this; I’ve developed an Extended Events Session to monitor timeouts and report (via email) when the threshold is reached.
What do I mean by “timeout”? Our applications use stored procedures to perform database access, so I am referring to timeout errors that are generated by the application when the timeout period has expired. By default, (using the providers that our application makes use of) this period is 30 seconds. Please note that this is NOT a SQL Server error; it’s a client-side error that occurs. So, the stored procedure executions are not generating the errors; the issue is that the application is cancelling the query before it completes.
Why do we get timeouts? There are several reasons, but what I am most concerned about in our production environment is when a stored procedure is compiled using “rarely” used parameters, resulting in a poorly performing plan for the majority of the calls to that procedure. This situation doesn’t happen very often (once every few months for us). When it does I like to review the procedure and have it re-written to avoid this situation (you can read many articles on parameter sniffing, etc. to get your fill on this topic). However, the immediate need in production is usually that 1) there are many users that cannot use a particular page in the application, and 2) I need to correct that as soon as possible. Many times the “quick fix” is to remove the offending stored procedures from cache, allowing the database engine to re-compile the proc and get a better performing plan for everyone. To do that I need to know what proc(s) are the issue at the time. I can go back after the emergency is over and review the code and re-write it to avoid this issue in the future.
In our shop there is monitoring in place from the application side that reviews all errors recorded by the application for timeout errors; it emails the web administrators to warn them. That’s all fine and dandy, however, the DBAs don’t get that notice and the errors they generate don’t contain the procedure names that were called (just the web pages). There is also database monitoring in place to capture all of the calls to the database. I recently converted this from Profiler Trace output to Extended Events (for SQL 2014) and it works great, but I capture this information into *.xed files and load summary information (the next day) into my DBA database for analyzing after-the fact. So this doesn’t really help me solve a “fix it now” production problem.
As fate would have it this issue happened again this week. Since I now have an Extended Events Session running I quickly clicked on “Watch Live Data” in SSMS for my session…ok…lots of data scrolling past…then it hit me; I can filter the data in the events for a duration greater than 30 seconds (actually I used 29 seconds), and EUREKA! I found my offending stored procedures! The events have the object name (i.e. procedure name) as part of the XML data, so I grabbed the names (I think there were a couple that day) and freed the stored procedure cache for those procs. Instantly my users were much happier.
So the emergency was over…but it took my web administrators 4 hours to notify me of a problem, and it took me another bit of time (it seemed like hours but was only minutes) to find the stored procedure names that were the problem and fix the issue.
After putting some thought into it and reading some other articles (thanks Grant Fritchey and the folks at Brent Ozar for inspiration!) I came up with an Extended Events Session to monitor timeouts and a simple SQL Agent Job to query the contents of the session ring buffer and report if there are timeout issues. There are a couple of variations of this out on the web and you can pick your favorite. I decided to monitor rpc_completed events since our applications use stored procedures (you can monitor sql_batch_completed events if you use T-SQL in your calls) and I decided to check the duration of each call for greater than 29 seconds since that’s what I see when timeouts occur in our environment; you can change that if you want as well.
So, without further ado…here’s my code. There are 2 parts. First, here is the code to create and start the Extended Events Session. I run this as a SQL Agent Job so that it runs only during business hours. I use a ring buffer to store the events since I’m only looking at recent activity. I have another SQL Agent Job (not shown here) that stops the session at the end of the business day. Note that I use a WHERE clause in my session to filter events to only those that hit my application databases, and I also exclude several stored procedures that are used by the .NET providers.
CREATE EVENT SESSION [Track Timeouts - Ring Buffer] ON SERVER ADD EVENT sqlserver.rpc_completed (SET collect_statement=1 ACTION ( sqlserver.SQL_text, sqlserver.database_id, sqlserver.client_pid, sqlserver.client_app_name, sqlserver.server_principal_name, sqlserver.session_id ) WHERE (database_name = 'myFirstApplicationDatabase' OR database_name = 'mySecondApplicationDatabase') AND (object_name <> 'sp_reset_connection' AND object_name <> 'sp_prepexec' AND object_name <> 'sp_unprepare') ) ADD TARGET package0.ring_buffer WITH (STARTUP_STATE=OFF) GO /* Start session */ ALTER EVENT SESSION [Track Timeouts - Ring Buffer] ON SERVER STATE=START;
The second part is the code that monitors the session. I have this in a SQL Agent Job that runs every 15 minutes and checks the ring buffer. It reads the buffer and actually reports the stored procedure names that are timing out so that you can free the cache for those procs.
set QUOTED_IDENTIFIER ON; -- -- grab the ring buffer for processing, converting to XML so I can read it! -- SELECT CAST(target_data as xml) AS targetdata INTO #CaptureData FROM sys.dm_xe_session_targets xet JOIN sys.dm_xe_sessions xes ON xes.address = xet.event_session_address WHERE xes.name = 'Track Timeouts - Ring Buffer' AND xet.target_name = 'ring_buffer'; -- -- determine if there are more than 10 timeouts in the buffer and report if necessary -- duration is in microseconds -- if 10 < (SELECT count(xed.event_data.value('(data[@name="duration"]/value)', 'int')) as CountMe FROM #CaptureData CROSS APPLY targetdata.nodes('//RingBufferTarget/event') AS xed (event_data) WHERE xed.event_data.value('(data[@name="duration"]/value)', 'int') > 29000000 ) begin DECLARE @Result varchar(4000) = 'Check the following procs - may need to free proc cache:' + char(10) + char(13), @ProcList varchar(4000); SELECT @procList = COALESCE(@procList + ', ', '') + TimeoutProcName FROM (SELECT xed.event_data.value('(data[@name="object_name"]/value)', 'varchar(255)') as TimeoutProcName FROM #CaptureData CROSS APPLY targetdata.nodes('//RingBufferTarget/event') AS xed (event_data) WHERE xed.event_data.value('(data[@name="duration"]/value)', 'int') > 29000000 ) q; SET @Result = @Result + @ProcList; -- SELECT @Result; -- my wrapper proc that calls sp_send_DBMail...you can use your own Exec MyDBADB.dbo.uspSendSQLAlert @Subject = 'PROD Timeout Tracking' ,@Priority = 'High' ,@Message = @Result ,@Process_Name = 'Timeout Alert' ,@Query = null; end; else begin print 'nothing to report'; end;
Here are a few links to the “inspiration” articles that I mentioned above!