Skip to content

What are the best practices for monitoring SQS dead letter queues in production?

0

We're running several SQS queues in production, all with DLQs configured. I set up CloudWatch alarms when we first deployed, but I'm honestly not confident the setup catches what it should.

My initial approach was to alarm on NumberOfMessagesSent for the DLQ, thinking that any message landing there was a signal to act on. Then I noticed that metric only increments when something actually sends to the DLQ, not when messages already sitting there are aging out or accumulating slowly. So a queue that's been leaking a message every hour might not trigger anything until the count spikes. That felt like a gap.

I switched to alarming on ApproximateNumberOfMessagesVisible instead, which seems better, but I'm not sure what threshold makes sense. Should it be 1? Some teams seem to tolerate a small backlog. I'm also not sure how to catch a slow buildup over time versus a sudden flood.

On retention: right now the DLQ has a 14-day retention period, matching the source queue. I've seen some teams set the DLQ retention much longer so they have time to investigate and replay. Is there a standard approach here, or does it depend entirely on message volume?

Finally, replaying messages is a pain. We're doing it manually via the SQS console, which doesn't scale. I know there's a redrive API, but I haven't built automation around it yet.

  1. Which CloudWatch metrics and alarm configurations do you actually use in production for DLQ monitoring, and how do you handle slow accumulation versus spikes?
  2. What retention period do you set on DLQs, and does your reasoning change based on message type or volume?
  3. How does your team handle message replay at scale, and have you built any automation or tooling around the redrive process?
1 Answer
0

Based on AWS best practices for monitoring SQS dead-letter queues in production:

1. CloudWatch Metrics and Alarm Configuration

You're right to switch from NumberOfMessagesSent to ApproximateNumberOfMessagesVisible. The NumberOfMessagesSent metric doesn't capture messages sent to a DLQ as a result of failed processing attempts, making it unsuitable for monitoring dead-letter queues. Instead, ApproximateNumberOfMessagesVisible captures all messages currently available in the DLQ, including those moved due to processing failures, making it the appropriate metric for setting up alarms.

For alarm thresholds, there isn't a one-size-fits-all answer, but setting it to trigger on any visible messages (threshold of 1 or higher) is a common approach since messages in a DLQ represent failures that need investigation. The key is that when the alarm triggers indicating messages have been moved to the dead-letter queue, you can poll the queue to review and retrieve them. To catch slow accumulation versus sudden spikes, you might consider setting up multiple alarms with different thresholds and evaluation periods—one for immediate alerting on any messages, and another for sustained accumulation over time.

2. DLQ Retention Period

Your current 14-day retention matching the source queue actually goes against recommended practices. It's a best practice to always set the retention period of a dead-letter queue to be longer than the retention period of the original queue. Here's why: for standard queues, when a message is moved to a dead-letter queue, the enqueue timestamp is unchanged and expiration is based on the original enqueue timestamp. If a message spends time in the original queue before moving to the DLQ, it will have less time remaining in the DLQ before deletion. For example, if a message spends 1 day in the original queue before moving to a DLQ with a 14-day retention, it would only remain in the DLQ for 13 days, not 14. Setting a longer DLQ retention period ensures you have adequate time to investigate and process failed messages without premature deletion.

3. Message Replay and Automation

While manual replay through the console works for small volumes, scaling this process requires automation. A redrive policy is what redirects messages to a dead-letter queue after the source queue fails to process a message a specified number of times. For moving messages out of the dead-letter queue, you can set it as an event source for a Lambda function or write custom consumer logic using the Amazon SQS API, AWS SDK, or AWS CLI. Building automation around the redrive process—whether through Lambda functions, scheduled jobs, or custom tooling—is essential for production environments handling significant message volumes.

Using a dead-letter queue decreases the number of messages in your source queues and reduces the possibility of exposing you to poison pill messages (messages that are received but can't be processed), which can distort metrics and cause false alarms. Proper monitoring and automation around DLQs ensures you can quickly identify, investigate, and remediate processing failures at scale.
Sources
Capturing problematic messages in Amazon SQS - Amazon Simple Queue Service
Creating alarms for dead-letter queues using Amazon CloudWatch - Amazon Simple Queue Service
Amazon SNS dead-letter queues - Amazon Simple Notification Service
Setting-up dead-letter queue retention in Amazon SQS - Amazon Simple Queue Service

answered 13 days ago
EXPERT
reviewed 13 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.