Why does the IteratorAgeMilliseconds value in Kinesis Data Streams keep increasing?

5 minutos de lectura

The IteratorAgeMilliseconds metric keeps increasing in Amazon Kinesis Data Streams.

Short description

The IteratorAgeMilliseconds metric in Kinesis Data Streams can increase for the following reasons:

  • Slow record processing
  • Read throttles
  • AWS Lambda function error
  • Connection timeout
  • Uneven data distribution among shards


Slow record processing

An overload of consumer processing logic can contribute to slow record processing. If the consumer is built using the Amazon Kinesis Client Library (KCL), then check for the following root causes:

  • Insufficient physical resources: Check to see if your instance has adequate amounts of physical resources such as memory or CPU utilization during peak demand.
  • Failure to scale: Consumer record processing logic can fail to scale with the increased load of the Amazon Kinesis data stream. You can verify scale failures by monitoring the other custom Amazon CloudWatch metrics emitted by KCL. These metrics are associated with the following operations: processTask, RecordProcessor.processRecords.Time, Success, and RecordsProcessed. You can also check the overall throughput of the Kinesis data stream by monitoring the CloudWatch metrics IncomingBytes and IncomingRecords. For more information about KCL and custom CloudWatch metrics, see Monitoring the Kinesis Client Library with Amazon CloudWatch. However, if the processing time can't be reduced, then consider upscaling the Kinesis stream by increasing the number of shards.
  • Overlapping processing increases: Consider checking the record processing logic of the consumer. If you see an increase in the processRecords.Time value that doesn't correlate with the increased traffic load, then check your record processing logic. Your record processing logic might be making synchronous blocking calls that can cause delays in consumer record processing. Another way to mitigate this issue is to increase the number of shards in your Kinesis Data Streams. For more information about the number of shards needed, see Resharding, scaling, and parallel processing.
  • Insufficient GetRecords requests: If the consumer isn't sending the GetRecords requests frequently enough, then the consumer application can fall behind. To verify, check the KCL configurations: withMaxRecords and withIdleTimeBetweenReadsInMillis.
  • Insufficient Throughput or High MillisBehindLatest: If you're using Amazon Kinesis Data Analytics for SQL, then see Insufficient throughput or High MillisBehindLatest or Consumer record processing falling behind for troubleshooting steps.

If the consumers fall behind and there is a risk of data expiration, then increase the retention period of the stream. By default, the retention period is 24 hours and it can be configured for up to one year. For more information about data retention periods, see Changing the data retention period.

Read throttles

Check the ReadProvisionedThroughputExceeded metric to see if there are read throttles on the stream.

Read throttles can be caused by one or more consumers breaching the 5 GetRecords calls per second limit. For more information about read throttles on Kinesis streams, see How do I detect and troubleshoot ReadProvisionedThroughputExceeded exceptions in Kinesis Data Streams?

Lambda function error

In CloudWatch, review the Lambda functions for the stream where the IteratorAgeMilliseconds count keeps increasing. You can identify the errors that are causing an increase in the IteratorAgeMilliseconds value by reviewing the Errors summary in CloudWatch. Slow processing can be caused by configurations in the Lambda trigger (for example, low batch size), calls being blocked, or Lambda memory provision. Check to see if the timestamp of the Lambda function error matches the time of the IteratorAgeMilliseconds metric increase of your Kinesis data stream. The match in timestamp confirms the cause of the increase. For more information, see Configuring Lambda function options.

Note: A Lambda function can throw an error because it's getting retried. The Lambda function gets retried because it doesn't skip the records as a consumer of Kinesis. As these records are retried, the delays in processing are also increased. Your consumer then falls behind the stream, causing the IteratorAgeMilliseconds metric to increase.

Intermittent connection timeout

Your consumer application can experience a connection timeout issue when pulling records from the Kinesis data stream. Intermittent connection timeout errors can cause a significant increase in the IteratorAgeMilliseconds count.

To verify whether the increase is related to a connection timeout, check the GetRecords.Latency and GetRecords.Success metrics. If both metrics are also impacted, then your IteratorAgeMilliseconds count stops increasing after the connection is restored.

Uneven data distribution among shards

Some shards in your Kinesis data stream might receive more records than others. This is because the partition key used in Put operations isn't equally distributing the data across the shards. This uneven data distribution results in fewer parallel GetRecords calls to the Kinesis data stream shards, causing an increase in the IteratorAgeMilliseconds count.

You can use random partition keys to distribute data evenly over the shards of the stream. Random partition keys can help the consumer application to read records faster.

Related information

Lambda event source mappings

Troubleshooting Kinesis Data Streams consumers

Using AWS Lambda with Amazon Kinesis

OFICIAL DE AWSActualizada hace 4 meses
Sin comentarios