SQS message age bump

0

Hi, We are observing that the metric Approximate age of oldest message is jumping suddenly by a significant amount. We have sev2 alarms around this metric and we have relaxed it in the past to avoid false alarm.s But we are still observing these bumps even though the duration for alarm is set to 25 minutes to avoid transient spikes.

Max receives = 5 Visibility Timeout = 15 minutes

Metric: https://tiny.amazon.com/eibqbl0s/snapshot

asked 24 days ago72 views
2 Answers
-1

It basically means the message is still in the queue triggering the alarm. With visibility timeout set to 15 mins, I suspect, your application is failing to process the message and resulting in deleting the message. Therefore, it stays in the queue. If you have DLQ configured, the failed message is sent to DLQ after (Max receives X Visibility Timeout) = 75 mins.

answered 24 days ago
  • Yes, we have dlq set and the message should enter the dlq after 5 reads(max-receives). How can the message age of 7 hours be explained?

-1

The sudden jumps you're observing in the "Approximate age of oldest message" metric for your SQS queue can be attributed to several factors related to how SQS handles message processing and reporting:

  1. Message reordering: For standard queues, when a message is received three or more times without being deleted, SQS moves it to the back of the queue. This reordering can cause the metric to suddenly reflect the age of the next oldest message that hasn't exceeded the receive threshold.

  2. Poison-pill messages: Messages that are repeatedly received but never successfully processed (often called "poison-pill" messages) are excluded from this metric until they are successfully processed. If such a message is suddenly processed or moved, it can cause a jump in the reported age.

  3. Dead-letter queue (DLQ) interaction: With your max receives set to 5, messages that fail processing 5 times will be moved to a DLQ if one is configured. When this happens, the age of the oldest message in the main queue can suddenly change to reflect the next oldest message.

  4. Visibility timeout: Your 15-minute visibility timeout means that messages become visible again if not processed within that time. This can lead to messages becoming "old" if they're repeatedly made visible but not processed.

To address these issues and reduce false alarms:

  1. Review your message processing logic to ensure messages are being handled efficiently and deleted when appropriate.

  2. Consider implementing a dead-letter queue if you haven't already, to isolate problematic messages.

  3. Adjust your visibility timeout if needed to give your consumers enough time to process messages.

  4. Monitor additional metrics like "ApproximateNumberOfMessagesVisible" and "NumberOfMessagesReceived" to get a more comprehensive view of queue behavior.

  5. Consider using a longer evaluation period for your CloudWatch alarm to smooth out short-term spikes, though be cautious not to mask real issues.

Remember, the "ApproximateAgeOfOldestMessage" metric is designed to help you monitor if your consumers are processing SQS messages at the desired speed. These jumps might indicate underlying issues with message processing that need to be addressed.
Sources
Recommended alarms - Amazon CloudWatch
Available CloudWatch metrics for Amazon SQS - Amazon Simple Queue Service

profile picture
answered 24 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions