Finding root cause of "processes blocked" on RDS SQL Server

3

Hi,

We are trying to debug a problem with our RDS SQL Server instance, in which the processes blocked spiked for about 10 minutes, and then went back to normal. During the spike the database became unresponsive and affected our production apps.

The main problem we face is that we cannot locate the root cause of the spike.

What we have done so far:

  1. Checked the logs. See below for the relevant lines.
  2. Manually (by matching timestamps) reviewed the queries that were running at the time. As far as we could tell, the queries are the same as those that run everyday without issue, and none of them seemed to be CPU or memory straining. Ofc, it's possible our manual review missed something.

Main question: Is there any other place we can look for clues as to what caused the spike?

Additional info:

Screenshots from RDS monitoring: aws monitoring tool aws monitoring tool aws monitoring tool aws monitoring tool

Snippet from logs:

2023-06-19 22:27:00.44 Server      A fatal error occurred while reading the input stream from the network. The session will be terminated (input error: 10054, output error: 0).
2023-06-19 23:14:01.76 spid53      I/O is frozen on database model. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2023-06-19 23:14:01.76 spid74      I/O is frozen on database msdb. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2023-06-19 23:14:01.76 spid85      I/O is frozen on database rdsadmin. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2023-06-19 23:14:01.76 spid116     I/O is frozen on database Moocho. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2023-06-19 23:14:01.76 spid86      I/O is frozen on database m@ster. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2023-06-19 23:14:02.16 spid74      I/O was resumed on database msdb. No user action is required.
2023-06-19 23:14:02.16 spid116     I/O was resumed on database Moocho. No user action is required.
2023-06-19 23:14:02.16 spid86      I/O was resumed on database m@ster. No user action is required.
2023-06-19 23:14:02.16 spid85      I/O was resumed on database rdsadmin. No user action is required.
2023-06-19 23:14:02.16 spid53      I/O was resumed on database model. No user action is required.
2023-06-20 00:00:27.97 spid23s     This instance of SQL Server has been using a process ID of 4340 since 4/3/2023 1:21:25 AM (local) 4/3/2023 5:21:25 AM (UTC). This is an informational message only; no user action is required.
cesar
asked 10 months ago496 views
2 Answers
0

Would you mind sharing the RDS SQL server version...as per design backup freezes IO for SQL server as backups r taken from primary server..the messages are inline with that..the fatal error message at network layer is generic too..can you review the below blog to see if that fits your context..

https://aws.amazon.com/blogs/database/monitor-deadlocks-in-amazon-rds-for-sql-server-and-set-notifications-using-amazon-cloudwatch/

AWS
answered 10 months ago
  • hi there sure, SQL Server Web Edition Instance class: db.r5.4xlarge Engine version: 12.00.6293.0.v1

    ty

0

One suggestion is to check how granular your metric collection is set to..is it upto 1sec..the reason i ask is if you set to 5mins then it wont be easy to catch this,as during the lapse period it might have occured. Check whether any IO spikes(IOPs,IO latency) during the same time period and have it corelated to the transactions that were shown...just be aware that setting to 1sec resolution would increase the log inflow,& may have an associated cost..

AWS
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions