By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Kinesis Data Streams: Persistent Throttling Issues Despite Shard Scaling

0

Description: We are experiencing consistent throttling with our Kinesis Data Stream (KDS) and seeking guidance on the root cause and resolution. Our key observations suggest the throttling behavior might not be directly related to data volume.

Current Setup:

  • KDS Configuration: 96 shards
  • Input Traffic: ~19,000 messages/second
  • Average Message Size: ~1KB
  • Consumer Configuration:
    • GetRecords batch size: 1,500 records
    • Current error handling: 1-second wait on throttling
    • Number of consumers: 32

Key Issues:

  1. Consistent "Request Exceeded Threshold" metric at 40-50%, regardless of shard count

    • We tested with different shard counts (24, 72, 96) with no improvement
    • This suggests the throttling might not be related to shard capacity
  2. Unexpected Behavior: Throttling occurs even during periods with no incoming data

    • This observation strongly indicates the throttling might be related to our GetRecords polling frequency rather than data volume

Specific Questions:

  1. Is the throttling likely due to hitting the GetRecords API limit (5 calls/second per shard) rather than data throughput limits?
  2. What are the recommended best practices for GetRecords polling frequency in our scenario? (Please note: Latency is key for us, and we dont want to wait when not required, hence we chose to wait for 1 sec on throttling)

Additional Context:

  • We've observed this behavior consistently across different shard configurations
  • Current monitoring shows throttling patterns remain similar regardless of incoming data volume

Requested Assistance:

  1. Confirmation of the root cause of throttling
  2. Recommendations for optimal consumer configuration
1 Answer
0

Based on the information provided, it appears that the persistent throttling issues you're experiencing with your Kinesis Data Stream (KDS) are likely due to hitting the GetRecords API limit rather than data throughput limits. Here's an analysis of the situation and some recommendations:

  1. Root Cause of Throttling: The consistent "Request Exceeded Threshold" metric at 40-50%, regardless of shard count, strongly suggests that you're hitting the GetRecords API limit. Each shard in Kinesis Data Streams supports up to 5 GetRecords calls per second. With 96 shards and 32 consumers, you may be making too many GetRecords calls too frequently.

The fact that throttling occurs even during periods with no incoming data further supports this conclusion. This behavior indicates that your consumers are continuously polling the stream, even when there's no new data to process.

  1. Recommendations for Optimal Consumer Configuration:

a. Adjust GetRecords Polling Frequency: Instead of waiting for 1 second on throttling, implement a more adaptive approach. Use exponential backoff with jitter when encountering throttling errors. Start with a small delay (e.g., 100ms) and increase it exponentially up to a maximum (e.g., 1 second) if throttling persists.

b. Optimize Batch Size: Your current batch size of 1,500 records is appropriate. However, consider dynamically adjusting this based on the actual data volume. During low-traffic periods, you could reduce the batch size to minimize unnecessary API calls.

c. Implement Adaptive Reads: Enable adaptive reads in your consumer application. This feature automatically adjusts the frequency of GetRecords calls based on the current data volume and processing speed, helping to reduce unnecessary API calls during quiet periods.

d. Consider Using Enhanced Fan-Out: If low latency is crucial for your use case, consider using Enhanced Fan-Out consumers. This feature provides dedicated throughput per consumer, allowing up to 2 MB/second of read throughput per shard without counting against the standard shard limits.

e. Optimize Number of Consumers: Review your consumer count (currently 32) in relation to your shard count (96). Ensure that you have an appropriate balance to avoid over-polling individual shards.

f. Implement Proper Error Handling: Instead of a fixed 1-second wait on throttling, implement a more sophisticated error handling mechanism that adapts to the current throttling rate and backpressure.

By implementing these recommendations, you should be able to reduce the throttling issues while maintaining the low latency requirements of your application. Monitor the "ReadProvisionedThroughputExceeded" metric in CloudWatch after making these changes to verify improvement. If issues persist, you may need to further fine-tune your consumer application or consider increasing the number of shards if your data volume genuinely requires it.
Sources
Handle startup, shutdown, and throttling - Amazon Kinesis Data Streams
Upstream or source throttling from a Kinesis data stream - Managed Service for Apache Flink
Troubleshoot Kinesis Data Streams issues in DynamoDB | AWS re:Post

profile picture
answered 19 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions