How do I detect and troubleshoot ReadProvisionedThroughputExceeded exceptions in Kinesis Data Streams?
5 minute read
I encounter a ReadProvisionedThroughputExceeded error in Amazon Kinesis Data Streams, and I don't know why this is happening.
The ReadProvisionedThroughputExceeded error occurs when Kinesis Data Streams throttle GetRecords calls over a duration of time.
If the following quotas are exceeded, then your Amazon Kinesis data stream can be throttled:
Each shard supports up to five read transactions per second, or five GetRecords calls per second for each shard.
Each shard supports a maximum read rate of 2 MiB per second.
GetRecords retrieves up to 10 MiB of data per call from a single shard and up to 10,000 records per call. If a call to GetRecords returns 10 MiB of data, then subsequent calls that are made within the next 5 seconds result in an error.
If you encounter a ReadProvisionedThroughputExceeded error, then complete one of the following tasks:
GetRecords.Bytes: The number of bytes that are retrieved from the data stream, measured over a specified time period.
GetRecords.Records: The number of records that are retrieved from the data stream over a specified time period.
ReadProvisionedThroughputExceeded: The number of GetRecords calls that are throttling in your data stream.
Set up your CloudWatch dashboard to display your statistics as a Sum with the time period set to 1 minute. Then, divide Sum by 60 seconds to get an average value.
For example, if use the GetRecords.Records metric value, then divide Sum by 60 seconds to calculate the average number of records sent per second. Then, check if the average value is less than the records sent per second for the limit that is set for your data stream. For more information about shard quotas, see Quotas and limits.
Note: Turn on the enhanced monitoring feature to make sure that the load is evenly distributed across all your shards.
You can also use the GetRecords.Records metric with the statistic viewed as a SampleCount and the time period set to 1 minute. Divide the SampleCount value by 60 seconds to calculate the average number of GetRecords calls made per second for each shard. If the average value is approximately five GetRecords calls per second and you get a ReadProvisionedThroughputExceeded error, then review your consumers and shard quotas. If the consumers don't exceed shard limits, then the ReadProvisionedThroughputExceeded error might be because your consumers are making more than five GetRecords calls per second.
Finally, check if there's a difference between the ReadProvisionedThroughputExceeded value of your shards. If the distribution of shards is uneven, or one shard receives more or less data than the other, then a distribution imbalance can occur. To resolve this shard distribution imbalance and avoid hot shards, use UUID as a partition key in the putRecords API call.
Identify a possible microburst
Although rare, metric values can be below shard quotas and cause a data stream to throttle during a read.
For example, a GetRecords.Bytes Sum:1min represents 10 MiB of data read for 1 minute. At 1 second, the GetRecords.Bytes call reads 2 MiB of data without any throttling. Then at 2 seconds, the GetRecords.Bytes call reads 8 MiB of data. At 3 seconds, there might not be any read operations or any throttling. Although the shard quota for the minute isn't reached (2MiB * 60 = 120MiB of data), you might receive a ReadProvisionedThroughputExceeded error. If you notice a sudden spike in the metric values, then look for the microburst that causes the ReadProvisionedThroughputExceeded exception.
Follow Kinesis Data Streams best practices
To mitigate ReadProvisionedThroughputExceeded exceptions, follow these best practices:
Reduce the size of the GetRecords requests. Configure the limit parameter, or reduce the frequency of GetRecords requests. Note: If the consumer is Amazon Kinesis Data Firehose, then the data stream adjusts to the frequency of the GetRecords calls that are made. If the consumer is an AWS Lambda function with event source mapping, then the stream is polled once every second. You can't modify the polling frequency. If the consumer is an Amazon Kinesis Client Library (KCL) application, then adjust the polling frequency. To adjust the polling frequency, modify the DEFAULT_IDLETIME_BETWEEN_READS_MILLIS parameter value in the KinesisClientLibConfiguration file. You can dynamically set this value in the code. For more information about how to modify this value in the KCL, see amazon-kinesis-client on the GitHub website.
Distribute read and write operations as evenly as possible across all the shards in Data Streams.
If your data stream uses more than five consumers, then use consumers with enhanced fan-out.
If you encounter ReadProvisionedThroughputExceeded exceptions, then use an error retry and exponential backoff mechanism in the consumer logic. For consumer applications that use an AWS SDK, the requests are retried by default.