- Newest
- Most votes
- Most comments
Based on AWS best practices for monitoring SQS dead-letter queues in production:
1. CloudWatch Metrics and Alarm Configuration
You're right to switch from NumberOfMessagesSent to ApproximateNumberOfMessagesVisible. The NumberOfMessagesSent metric doesn't capture messages sent to a DLQ as a result of failed processing attempts, making it unsuitable for monitoring dead-letter queues. Instead, ApproximateNumberOfMessagesVisible captures all messages currently available in the DLQ, including those moved due to processing failures, making it the appropriate metric for setting up alarms.
For alarm thresholds, there isn't a one-size-fits-all answer, but setting it to trigger on any visible messages (threshold of 1 or higher) is a common approach since messages in a DLQ represent failures that need investigation. The key is that when the alarm triggers indicating messages have been moved to the dead-letter queue, you can poll the queue to review and retrieve them. To catch slow accumulation versus sudden spikes, you might consider setting up multiple alarms with different thresholds and evaluation periods—one for immediate alerting on any messages, and another for sustained accumulation over time.
2. DLQ Retention Period
Your current 14-day retention matching the source queue actually goes against recommended practices. It's a best practice to always set the retention period of a dead-letter queue to be longer than the retention period of the original queue. Here's why: for standard queues, when a message is moved to a dead-letter queue, the enqueue timestamp is unchanged and expiration is based on the original enqueue timestamp. If a message spends time in the original queue before moving to the DLQ, it will have less time remaining in the DLQ before deletion. For example, if a message spends 1 day in the original queue before moving to a DLQ with a 14-day retention, it would only remain in the DLQ for 13 days, not 14. Setting a longer DLQ retention period ensures you have adequate time to investigate and process failed messages without premature deletion.
3. Message Replay and Automation
While manual replay through the console works for small volumes, scaling this process requires automation. A redrive policy is what redirects messages to a dead-letter queue after the source queue fails to process a message a specified number of times. For moving messages out of the dead-letter queue, you can set it as an event source for a Lambda function or write custom consumer logic using the Amazon SQS API, AWS SDK, or AWS CLI. Building automation around the redrive process—whether through Lambda functions, scheduled jobs, or custom tooling—is essential for production environments handling significant message volumes.
Using a dead-letter queue decreases the number of messages in your source queues and reduces the possibility of exposing you to poison pill messages (messages that are received but can't be processed), which can distort metrics and cause false alarms. Proper monitoring and automation around DLQs ensures you can quickly identify, investigate, and remediate processing failures at scale.
Sources
Capturing problematic messages in Amazon SQS - Amazon Simple Queue Service
Creating alarms for dead-letter queues using Amazon CloudWatch - Amazon Simple Queue Service
Amazon SNS dead-letter queues - Amazon Simple Notification Service
Setting-up dead-letter queue retention in Amazon SQS - Amazon Simple Queue Service
Relevant content
- asked 10 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
