- Newest
- Most votes
- Most comments
To address your shard distribution challenges with Amazon Kinesis Data Streams, consider implementing the following best practices:
-
Use the Kinesis Client Library (KCL): The KCL is designed to handle shard distribution and load balancing automatically. It manages the distribution of shards across multiple worker instances, ensuring a more even distribution and handling resharding scenarios efficiently.
-
Implement a dynamic shard assignment strategy: Instead of using a static hash-based approach, implement a dynamic assignment strategy that periodically reassesses the shard distribution. This can help maintain a more balanced load across your pollers.
-
Leverage the new Kinesis Streams Source connector: If you're using Apache Flink, consider using the new Kinesis Streams Source connector. It comes with a default UniformShardAssigner that maintains a uniform distribution of stream partitionIds across parallel subtasks, even during resharding operations.
-
Monitor and adjust: Use Amazon CloudWatch metrics to monitor the performance of your stream and individual shards. Identify hot or cold shards and adjust your distribution strategy accordingly.
-
Implement a buffer: To handle potential spikes in data flow and reduce the risk of hitting CPU limits, implement a buffer system that can temporarily store records before processing.
-
Consider scaling options: Instead of focusing solely on shard distribution, also consider scaling your application by increasing instance size or the number of instances up to the maximum number of open shards.
-
Implement graceful resharding handling: When resharding occurs, ensure your application continues to process data from parent shards until they're exhausted before moving to child shards. This maintains data order and allows for smoother transitions during resharding events.
-
Use Auto Scaling: Implement Auto Scaling for your poller instances. This can help automatically adjust the number of instances based on the load, which is particularly useful when the number of shards increases due to resharding.
By implementing these practices, you can achieve a more even distribution of shards, handle resharding scenarios more efficiently, and maintain a balanced CPU load across your Node.js pollers. Remember to test different combinations of these approaches to find the optimal solution for your specific use case.
Sources
Introducing the new Amazon Kinesis source connector for Apache Flink | AWS Big Data Blog
Use resharding, scaling, and parallel processing to change the number of shards - Amazon Kinesis Data Streams
Decide on a strategy for resharding - Amazon Kinesis Data Streams
Complete the resharding action - Amazon Kinesis Data Streams
Hi,
A best practice is to try to distribute the input records evenly across all shards for various reasons (cost, performances, etc.).
It is well explained in this article: https://medium.com/onebyte-llc/uniform-data-distribution-among-kinesis-data-stream-shards-7d350bca4a99
Best,
Didier
Relevant content
- asked a year ago
