Kinesis data stream iterator age spikes

0

Hi,

I am working on a system that uses kinesis to ingest data, this data is then processed by a lambda function and kinesis delivery stream. We have CloudWatch alarms set up that trigger if the iterator age for a stream goes above 10 seconds, as this means that lambda has crashed and can no longer process data, causing a build up of data.

Over the last couple of weeks, we have seen an increase in alarm triggers. The cause is that the iterator age spikes to an insanely high number and then drops again within 1 to 2 minutes. There does not appear to be any pattern to the spikes, sometimes it is multiple a day, sometimes not for a couple of days. Some articles and post seem to mention that this could be caused by data built up or hitting a limit of the stream. However, when these spikes happen, there is usually no data being added to the stream. There have even been instances where only the stream was configured, but there was no system to put any data on the stream, and the spikes still occurred.

This is what the spikes look like over a 3-day period. Enter image description here

I have already checked the following causes, and it is None of these issues:

  • Lambda stuck processing
  • Lambda error (there is also an alarm for this)
  • A lot of data added in a short time

The stream is configured with 1 shard, enhanced fan-out is disabled, server-side encryption is enabled, the data retention is set to 1 day, and there are only 2 consumers. The kinesis delivery stream is configured with convert record format enabled, converting data to Parquet from JSON using standard AWS functions, there is no custom lambda. The delivery stream reports no failures. The lambda has starting position "LATEST" and max record age at "-1".

What could be the cause of these spikes? Or how could I investigate this and figure out a solution?

1 Answer
0

I would recommend checking the Lambda metrics for the same time frame and check if the number of invocations changed, or if the duration changed. If you do not see any difference compared to other times, I recommend to create a support case.

profile pictureAWS
EXPERT
Uri
answered 2 years ago
  • The spikes in iterator age are almost always when there are no lambda invocations, and it seems to happen more often when there is fewer data on the stream, but that could just be a coincidence. I have even seen this error happen multiple times when there was no data on the stream for days.

    We are not on an AWS support plan, so we cannot create a support case. I guess we'll just have to keep going to figure out what is going on here.

  • This is really strange. The Iterator Age represents how far behind you are processing the events in the stream. Could it be that you have more than once consumer? If that is the case, and the second consumer doesn't reads messages constantly, when it will read messages, the iterator age will be big.

    What metric are you looking at? GetRecords.IteratorAgeMilliseconds or the Lambda IteratorAge. If you are using the first one, I would expect that you have another consumer. If the second one, I have no clue.

  • There are 2 consumers. An AWS firehose delivery stream that converts records from JSON to parquet using standard configuration, so no lambda. And a lambda that processes the data on the stream. The entire application is configured using CDK, so besides these 2, there are no more consumers of this stream.

    I am looking at the GetRecords.IteratorAgeMilliseconds metric of the kinesis stream. And Iterator age of the consumer lambda always hovers around 800 to 1800 milliseconds. What is a bit is another metric I found while investigating this, which is the "KinesisMillisBehindLatest" metric of the delivery stream. And this metric is almost identical to the "GetRecords.IteratorAgeMilliseconds" metric, but not always. Sometimes the kinesis stream has an iterator age of 11M and the "KinesisMillisBehindLatest" metric of the delivery stream is still at 0.

    I did manage to find this GitHub issue that describes a similar issue. https://github.com/awslabs/amazon-kinesis-client/issues/185 but in my case I am not using the amazon-kinesis client. I am only using a lambda which is fed directly by the stream, and a delivery stream that converts records using Glue.

  • I do not know what is causing the issue, however, I think you are only interested in the iterator age of the Lambda function, so I would change my alarm to that metric.

  • That seems like it would probably be better in this case. I'll move the alarm to that metric. I'd rather have a proper solution for those stream spikes, but I'll settle for this. Thanks for your help.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions