- Newest
- Most votes
- Most comments
Hello.
Is this error a recent issue?
Or has it been occurring for a long time?
Is your "Message throughput" exceeding the limit described in the following document?
For example, it's possible that multiple clients are accessing a single SQS queue, causing the limit to be exceeded.
You might be able to reduce 500 errors by using batch processing to streamline message processing or by using long polling to reduce empty responses, so please check those options.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-messages.html
Since SQS is designed for "at-least-once delivery," it's recommended to implement idempotency in your application so that processing the same message multiple times doesn't cause problems.
Furthermore, implementing retry functionality in case of such errors may help avoid temporary 500 errors.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/standard-queues-at-least-once-delivery.html
Hi,
Thank you for your reply.
We have been facing this issue for the past three weeks. The message throughput is not scaling as expected, and we are intermittently receiving SQS 500 Internal Failure errors.
This issue occurs even on low-traffic queues (around 1,400–2,000 messages per minute) and during off-peak hours (night time), when application load is minimal.
We expected better throughput under these conditions, but the issue persists regardless of traffic levels.
Could anyone help clarify:
Why this might be happening even during low traffic / off-peak hours? Whether there are any known limits or internal throttling scenarios for SQS? Recommended best practices or architecture changes to handle this more reliably?
Thanks in advance for your help.
We have also contacted AWS Support, but they have not been able to provide the exact root cause for this issue.
AWS Support Reply
I've reviewed your case regarding the SQS FIFO high throughput queue experiencing 500 errors. Let me provide you with a comprehensive analysis and answers to your questions as provided by our service team.
Analysis of Your Issue
Based on your detailed case description, you're experiencing intermittent InternalFailure (HTTP 500) errors on your FIFO queue tallydataconnector.fifo in ap-south-1. The pattern you've identified is highly characteristic of internal AWS service events rather than configuration issues on your side.
Answers to Your Specific Questions
1. Confirmation of Partition Rebalancing
Your hypothesis about partition rebalancing is reasonable. The burst pattern (10-15 failures in 1-2 seconds, then automatic recovery) is consistent with internal partition operations. To get definitive confirmation, AWS needs to:
- Review internal service logs for the provided Request IDs
- Check for partition split/rebalance events on your queue during the affected timeframes
- This requires escalation to the SQS service team for internal investigation
2. Expected Duration of Partition Rebalance
Your observed 1-2 second burst duration aligns with typical partition operation timeframes. However, the exact SLA and expected duration can only be confirmed by the SQS service team through internal documentation.
3. Configuration to Reduce Rebalancing Frequency
Unfortunately, there are no customer-facing controls for:
- Partition management
- Reserved concurrency for SQS
- Pre-warming mechanisms
- Controlling rebalance triggers
Partition management is entirely handled by the SQS service internally and is transparent to customers.
4. Using DeleteMessageBatch
Yes, this is strongly recommended. Using DeleteMessageBatch (up to 10 messages per call) will:
- Reduce the total number of API calls
- Lower per-partition TPS pressure
- Potentially reduce rebalancing frequency
- Improve overall throughput and cost efficiency
This is a best practice regardless of the 500 error issue.
5. CloudWatch Metrics for Partition Rebalancing
No direct metric exists. AWS does not expose partition-level metrics or rebalancing events to customers. However, you can indirectly monitor through:
NumberOfMessagesDeleted(which you're already tracking - drops indicate issues)ApproximateNumberOfMessagesVisible(spikes when deletes fail)- Custom CloudWatch alarms based on SDK error counts with status code 500
6. Message Retention Period Sufficiency
Your 16-minute retention period is extremely short and poses a high risk of message loss. Here's why:
- Visibility timeout: 5 minutes (based on your description)
- Rebalance event + recovery time: ~1-2 seconds
- Consumer processing time: variable
- If a consumer crashes or experiences delays, messages could expire before reprocessing
Recommendation: Increase retention to at least 4 hours (or even the default 4 days) to provide adequate buffer for:
- Multiple redelivery attempts
- Consumer recovery time
- Operational incidents
Immediate Action Items
- Implement Exponential Backoff with Jitter: Ensure your application retries 500 errors with exponential backoff
- Switch to DeleteMessageBatch: Refactor to batch delete operations
- Increase Message Retention: Change from 16 minutes to minimum 4 hours
- Monitor Error Rates: Set up CloudWatch alarms for 500 error spikes
Relevant content
- asked 2 years ago

It's puzzling that this is happening even during off-peak hours... I think it will be difficult to resolve the issue without investigating the internal workings of AWS. If the issue lies with AWS, you will need to contact AWS support.
Regarding throttling, I believe the only publicly available information is the quota limits described in the following document. https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-messages.html
As I already mentioned in my answer, you may be able to reduce 500 errors by using batch processing to streamline message processing or by using long polling to reduce empty responses in order to address this problem. Additionally, implementing a retry function may help avoid temporary 500 errors. The retry function is an effective measure for errors caused by service throttling and other issues. The retry function is also explained in the following document. https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/retry-backoff.html
As mentioned above, did you share the request ID with AWS support and have them investigate? Based on what you've shared, it appears that no detailed research was conducted, and the response seems to be an automated response from AI. I think you need to contact AWS support and ask them to investigate further.