Skip to content

SQS FIFO High Throughput Queue 500 error

0

Hello We are frequently encountering SQS 500 Internal Failure errors across multiple queues, including low-traffic queues and during off-peak hours. I have attached the application logs for your reference.

Queue configuration Default visibility timeout: 5 Minutes Message retention period:16 Minutes Delivery delay:1 Second Receive message wait time: 20 Seconds High throughput FIFO: Enabled Content-based deduplication: Enabled Maximum message size: 256 KiB

App Logs 2026-03-10 07:27:41.943889309 +0000 UTC m=+402487.028984961 Error: deleting message InternalFailure: 2026-03-11 06:11:19.878762261 +0000 UTC m=+484304.963857883 Error: receiving messages: InternalFailure: 2026-03-11 06:11:19.942514973 +0000 UTC m=+484305.027610595 Error: deleting message InternalFailure: 2026-03-11 06:11:20.319294535 +0000 UTC m=+484305.404390158 Error: deleting message InternalFailure: 2026-03-11 06:11:20.3261207 +0000 UTC m=+484305.411216312 Error: deleting message InternalFailure: 2026-03-11 06:11:20.347928398 +0000 UTC m=+484305.433024090 Error: deleting message InternalFailure: 2026-03-11 06:40:47.521356251 +0000 UTC m=+486072.606451872 Error: deleting message InternalFailure: 2026-03-11 06:40:47.593268184 +0000 UTC m=+486072.678363796 Error: receiving messages: InternalFailure: Error sending message to voucher replicate SQS: InternalFailure: 2026-03-11 06:56:14.178716816 +0000 UTC m=+486999.263812438 Error: receiving messages: InternalFailure: 2026-03-11 06:56:40.827525755 +0000 UTC m=+487025.912621377 Error: deleting message InternalFailure: 2026-03-11 06:56:40.864196032 +0000 UTC m=+487025.949291654 Error: deleting message InternalFailure: 2026-03-11 06:56:40.953278475 +0000 UTC m=+487026.038374097 Error: deleting message InternalFailure: 2026-03-11 06:56:41.304870813 +0000 UTC m=+487026.389966425 Error: deleting message InternalFailure: 2026-03-11 06:56:41.335472242 +0000 UTC m=+487026.420567864 Error: deleting message InternalFailure: 2026-03-11 06:56:41.697253769 +0000 UTC m=+487026.782349391 Error: deleting message InternalFailure: 2026-03-11 07:13:53.616698588 +0000 UTC m=+488058.701794230 Error: receiving messages: InternalFailure: 2026-03-11 10:11:37.55142363 +0000 UTC m=+498722.636519242 Error: receiving messages: InternalFailure: 2026-03-11 10:11:38.014544962 +0000 UTC m=+498723.099640584 Error: deleting message InternalFailure: 2026-03-11 10:11:39.062919602 +0000 UTC m=+498724.148015244 Error: deleting message InternalFailure: 2026-03-11 10:11:42.838998589 +0000 UTC m=+498727.924094201 Error: receiving messages: InternalFailure: 2026-03-11 10:12:56.180491895 +0000 UTC m=+498801.265587547 Error: deleting message InternalFailure: 2026-03-11 10:12:56.216866365 +0000 UTC m=+498801.301962017 Error: receiving messages: InternalFailure: 2026-03-11 10:12:56.244862041 +0000 UTC m=+498801.329957653 Error: deleting message InternalFailure: 2026-03-11 10:12:56.276484388 +0000 UTC m=+498801.361580010 Error: deleting message InternalFailure: Error sending message to Stock Voucher Summary SQS: InternalFailure: 2026-03-11 10:12:56.321247482 +0000 UTC m=+498801.406343094 Error: deleting message InternalFailure: 2026-03-11 11:03:12.658919999 +0000 UTC m=+501817.744015611 Error: deleting message InternalFailure: 2026-03-11 11:03:12.705062009 +0000 UTC m=+501817.790157661 Error: receiving messages: InternalFailure: 2026-03-11 11:03:12.73325932 +0000 UTC m=+501817.818354942 Error: deleting message InternalFailure: 2026-03-11 11:03:12.785975069 +0000 UTC m=+501817.871070721 Error: deleting message InternalFailure: 2026-03-11 11:03:13.002548906 +0000 UTC m=+501818.087644558 Error: deleting message InternalFailure: 2026-03-11 11:03:13.194992321 +0000 UTC m=+501818.280087933 Error: deleting message InternalFailure: 2026-03-11 13:57:13.446625635 +0000 UTC m=+32.764342518 Error: receiving messages: InternalFailure: 2026-03-24 04:47:52.151006075 +0000 UTC m=+312412.661497427 Error: deleting message InternalFailure: 2026-03-24 04:48:09.150456946 +0000 UTC m=+312429.660948287 Error: receiving messages: InternalFailure: 2026-03-24 04:48:27.573323024 +0000 UTC m=+312448.083814345 Error: receiving messages: InternalFailure: 2026-03-24 04:56:00.281318174 +0000 UTC m=+312900.791809506 Error: deleting message InternalFailure: 2026-03-24 04:56:00.299659531 +0000 UTC m=+312900.810150853 Error: deleting message InternalFailure: 2026-03-24 04:56:00.550066281 +0000 UTC m=+312901.060557603 Error: deleting message InternalFailure: 2026-03-24 04:56:00.692105974 +0000 UTC m=+312901.202597295 Error: deleting message InternalFailure: 2026-03-24 04:56:00.846375163 +0000 UTC m=+312901.356866495 Error: deleting message InternalFailure: 2026-03-24 04:56:00.886222338 +0000 UTC m=+312901.396713660 Error: deleting message InternalFailure: 2026-03-24 04:56:00.913753694 +0000 UTC m=+312901.424245036 Error: deleting message InternalFailure: 2026-03-24 05:02:47.813644757 +0000 UTC m=+313308.324136099 Error: receiving messages: InternalFailure: Error sending message to voucher replicate SQS: InternalFailure: 2026-03-24 06:05:45.180859452 +0000 UTC m=+317085.691350774 Error: receiving messages: InternalFailure: 2026-03-24 06:05:45.535995538 +0000 UTC m=+317086.046486860 Error: deleting message InternalFailure: 2026-03-24 06:05:45.547217308 +0000 UTC m=+317086.057708650 Error: deleting message InternalFailure: 2026-03-24 06:05:45.580599892 +0000 UTC m=+317086.091091214 Error: deleting message InternalFailure: 2026-03-24 06:05:45.601231496 +0000 UTC m=+317086.111722808 Error: deleting message InternalFailure: 2026-03-24 06:05:45.630170223 +0000 UTC m=+317086.140661535 Error: deleting message InternalFailure: 2026-03-24 06:05:45.722186475 +0000 UTC m=+317086.232677797 Error: deleting message InternalFailure: 2026-03-24 06:05:45.762614017 +0000 UTC m=+317086.273105339 Error: deleting message InternalFailure: 2026-03-24 06:05:45.972659732 +0000 UTC m=+317086.483151054 Error: deleting message InternalFailure: Error sending message to voucher replicate SQS: InternalFailure: 2026-03-24 06:05:46.004580657 +0000 UTC m=+317086.515071979 Error: deleting message InternalFailure: 2026-03-24 06:05:46.100860553 +0000 UTC m=+317086.611351875 Error: deleting message InternalFailure: 2026-03-24 06:05:46.131640766 +0000 UTC m=+317086.642132088 Error: deleting message InternalFailure: 2026-03-24 06:05:46.131666756 +0000 UTC m=+317086.642158078 Error: deleting message InternalFailure: 2026-03-24 06:05:46.15177709 +0000 UTC m=+317086.662268442 Error: deleting message InternalFailure: 2026-03-24 06:05:46.234525551 +0000 UTC m=+317086.745016893 Error: deleting message InternalFailure: 2026-03-24 06:05:46.427691385 +0000 UTC m=+317086.938182757 Error: deleting message InternalFailure: 2026-03-24 06:05:46.579707471 +0000 UTC m=+317087.090198823 Error: deleting message InternalFailure: 2026-03-24 06:05:46.780581647 +0000 UTC m=+317087.291073009 Error: deleting message InternalFailure: 2026-03-24 06:05:47.104506112 +0000 UTC m=+317087.614997424 Error: deleting message InternalFailure: 2026-03-24 06:18:18.797102159 +0000 UTC m=+317839.307593481 Error: receiving messages: InternalFailure: 2026-03-24 06:18:19.311996375 +0000 UTC m=+317839.822487727 Error: deleting message InternalFailure: 2026-03-24 06:18:19.638922441 +0000 UTC m=+317840.149413763 Error: deleting message InternalFailure: 2026-03-24 06:18:19.754606093 +0000 UTC m=+317840.265097485 Error: deleting message InternalFailure: 2026-03-24 06:18:19.808002973 +0000 UTC m=+317840.318494325 Error: deleting message InternalFailure: 2026-03-24 06:18:19.837750072 +0000 UTC m=+317840.348241394 Error: deleting message InternalFailure: 2026-03-24 06:18:20.805314381 +0000 UTC m=+317841.315805703 Error: deleting message InternalFailure: 2026-03-24 06:18:22.839087645 +0000 UTC m=+317843.349578988 Error: deleting message InternalFailure: 2026-03-24 06:18:24.05988678 +0000 UTC m=+317844.570378132 Error: receiving messages: InternalFailure: 2026-03-24 06:21:29.098225772 +0000 UTC m=+318029.608717084 Error: deleting message InternalFailure: 2026-03-24 06:21:29.217983453 +0000 UTC m=+318029.728474765 Error: deleting message InternalFailure: 2026-03-24 08:06:43.241458531 +0000 UTC m=+324343.751949853 Error: deleting message InternalFailure: 2026-03-24 08:06:43.280634812 +0000 UTC m=+324343.791126164 Error: deleting message InternalFailure: 2026-03-24 08:06:43.317342405 +0000 UTC m=+324343.827833727 Error: deleting message InternalFailure: 2026-03-24 08:06:43.421738924 +0000 UTC m=+324343.932230266 Error: deleting message InternalFailure: 2026-03-24 08:06:44.268353875 +0000 UTC m=+324344.778845216 Error: deleting message InternalFailure: 2026-03-24 08:06:44.316766045 +0000 UTC m=+324344.827257367 Error: deleting message InternalFailure: 2026-03-24 08:06:44.344584386 +0000 UTC m=+324344.855075698 Error: deleting message InternalFailure: 2026-03-25 04:09:18.976241257 +0000 UTC m=+51432.128421412 Error: receiving messages: InternalFailure: 2026-03-25 08:03:39.593440378 +0000 UTC m=+65492.745620533 Error: deleting message InternalFailure: 2026-03-25 10:07:04.824873278 +0000 UTC m=+72897.977053433 Error: receiving messages: InternalFailure: 2026-03-25 10:07:06.160225491 +0000 UTC m=+72899.312405616 Error: receiving messages: InternalFailure: 2026-03-25 10:07:07.478064486 +0000 UTC m=+72900.630244611 Error: receiving messages: InternalFailure: 2026-03-25 14:54:07.097683346 +0000 UTC m=+90120.249863471 Error: receiving messages: InternalFailure: 2026-03-25 17:51:05.077767934 +0000 UTC m=+100738.229948089 Error: deleting message InternalFailure: 2026-03-25 17:51:05.164390008 +0000 UTC m=+100738.316570133 Error: deleting message InternalFailure: 2026-03-25 17:51:05.6355201 +0000 UTC m=+100738.787700255 Error: deleting message InternalFailure: 2026-03-25 17:51:05.7939505 +0000 UTC m=+100738.946130645 Error: deleting message InternalFailure: 2026-03-25 17:51:06.133788673 +0000 UTC m=+100739.285968799 Error: deleting message InternalFailure: 2026-03-25 17:51:06.89847673 +0000 UTC m=+100740.050656865 Error: deleting message InternalFailure:

3 Answers
0

Hello.

Is this error a recent issue?
Or has it been occurring for a long time?
Is your "Message throughput" exceeding the limit described in the following document?
For example, it's possible that multiple clients are accessing a single SQS queue, causing the limit to be exceeded.
You might be able to reduce 500 errors by using batch processing to streamline message processing or by using long polling to reduce empty responses, so please check those options.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-messages.html

Since SQS is designed for "at-least-once delivery," it's recommended to implement idempotency in your application so that processing the same message multiple times doesn't cause problems.
Furthermore, implementing retry functionality in case of such errors may help avoid temporary 500 errors.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/standard-queues-at-least-once-delivery.html

EXPERT
answered a month ago
EXPERT
reviewed a month ago
  • Why this might be happening even during low traffic / off-peak hours?

    It's puzzling that this is happening even during off-peak hours... I think it will be difficult to resolve the issue without investigating the internal workings of AWS. If the issue lies with AWS, you will need to contact AWS support.

    Whether there are any known limits or internal throttling scenarios for SQS?

    Regarding throttling, I believe the only publicly available information is the quota limits described in the following document. https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-messages.html

    Recommended best practices or architecture changes to handle this more reliably?

    As I already mentioned in my answer, you may be able to reduce 500 errors by using batch processing to streamline message processing or by using long polling to reduce empty responses in order to address this problem. Additionally, implementing a retry function may help avoid temporary 500 errors. The retry function is an effective measure for errors caused by service throttling and other issues. The retry function is also explained in the following document. https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/retry-backoff.html

  • Review internal service logs for the provided Request IDs Check for partition split/rebalance events on your queue during the affected timeframes This requires escalation to the SQS service team for internal investigation

    As mentioned above, did you share the request ID with AWS support and have them investigate? Based on what you've shared, it appears that no detailed research was conducted, and the response seems to be an automated response from AI. I think you need to contact AWS support and ask them to investigate further.

0

Hi,

Thank you for your reply.

We have been facing this issue for the past three weeks. The message throughput is not scaling as expected, and we are intermittently receiving SQS 500 Internal Failure errors.

This issue occurs even on low-traffic queues (around 1,400–2,000 messages per minute) and during off-peak hours (night time), when application load is minimal.

We expected better throughput under these conditions, but the issue persists regardless of traffic levels.

Could anyone help clarify:

Why this might be happening even during low traffic / off-peak hours? Whether there are any known limits or internal throttling scenarios for SQS? Recommended best practices or architecture changes to handle this more reliably?

Thanks in advance for your help.

answered a month ago
0

We have also contacted AWS Support, but they have not been able to provide the exact root cause for this issue.

AWS Support Reply

I've reviewed your case regarding the SQS FIFO high throughput queue experiencing 500 errors. Let me provide you with a comprehensive analysis and answers to your questions as provided by our service team.

Analysis of Your Issue

Based on your detailed case description, you're experiencing intermittent InternalFailure (HTTP 500) errors on your FIFO queue tallydataconnector.fifo in ap-south-1. The pattern you've identified is highly characteristic of internal AWS service events rather than configuration issues on your side.

Answers to Your Specific Questions

1. Confirmation of Partition Rebalancing

Your hypothesis about partition rebalancing is reasonable. The burst pattern (10-15 failures in 1-2 seconds, then automatic recovery) is consistent with internal partition operations. To get definitive confirmation, AWS needs to:

  • Review internal service logs for the provided Request IDs
  • Check for partition split/rebalance events on your queue during the affected timeframes
  • This requires escalation to the SQS service team for internal investigation

2. Expected Duration of Partition Rebalance

Your observed 1-2 second burst duration aligns with typical partition operation timeframes. However, the exact SLA and expected duration can only be confirmed by the SQS service team through internal documentation.

3. Configuration to Reduce Rebalancing Frequency

Unfortunately, there are no customer-facing controls for:

  • Partition management
  • Reserved concurrency for SQS
  • Pre-warming mechanisms
  • Controlling rebalance triggers

Partition management is entirely handled by the SQS service internally and is transparent to customers.

4. Using DeleteMessageBatch

Yes, this is strongly recommended. Using DeleteMessageBatch (up to 10 messages per call) will:

  • Reduce the total number of API calls
  • Lower per-partition TPS pressure
  • Potentially reduce rebalancing frequency
  • Improve overall throughput and cost efficiency

This is a best practice regardless of the 500 error issue.

5. CloudWatch Metrics for Partition Rebalancing

No direct metric exists. AWS does not expose partition-level metrics or rebalancing events to customers. However, you can indirectly monitor through:

  • NumberOfMessagesDeleted (which you're already tracking - drops indicate issues)
  • ApproximateNumberOfMessagesVisible (spikes when deletes fail)
  • Custom CloudWatch alarms based on SDK error counts with status code 500

6. Message Retention Period Sufficiency

Your 16-minute retention period is extremely short and poses a high risk of message loss. Here's why:

  • Visibility timeout: 5 minutes (based on your description)
  • Rebalance event + recovery time: ~1-2 seconds
  • Consumer processing time: variable
  • If a consumer crashes or experiences delays, messages could expire before reprocessing

Recommendation: Increase retention to at least 4 hours (or even the default 4 days) to provide adequate buffer for:

  • Multiple redelivery attempts
  • Consumer recovery time
  • Operational incidents

Immediate Action Items

  1. Implement Exponential Backoff with Jitter: Ensure your application retries 500 errors with exponential backoff
  2. Switch to DeleteMessageBatch: Refactor to batch delete operations
  3. Increase Message Retention: Change from 16 minutes to minimum 4 hours
  4. Monitor Error Rates: Set up CloudWatch alarms for 500 error spikes
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.