Consumer is not rebalancing when AWS MSK maintenance going on

0

We are observing that on the consumer side whenever consumer max.poll.interval.ms is breached while AWS Maintenance is going on then the consumer stops consuming messages from the topic. Basically, we see a log like below on the consumer side and then the consumer stops consuming messages.

[Consumer clientId=consumer-aws.fct.entityupdate.webhooklistener.consumer-14, groupId=aws.fct.entityupdate.webhooklistener.consumer] consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

We have two instances of consumer running and once max.poll.interval.ms gets breached on both (while AWS MSK maintenance is on) then the consumer group stops consuming message from Kafka. The only fix is to restart the application containing the consumers.

The problem happens only during AWS MSK Maintainance. If there are no maintainance then it is working as expected and we are observing that once a poll timeout happens, the consumer leaves the group and new consumer joins the group and new consumer keeps on consuming messages.

  • Amazon MSK with Kafka 2.8.1 (Provisoned)
  • Apache Kafka Client 3.5.1
  • 2 instances of Application running on ECS Fargate where one consumer instance running in both . Consumers belong to the same consumer group and consuming from same topic having 10 partitions.

Can someone let me know if this is expected during AWS MSK Maintaince ?

I do not see much logs on the broker side also.

tuk
asked 6 months ago314 views
1 Answer
0

Hello,

I would like to inform you that MSK performs Broker update and Patching workflows using a rolling restart process which allows MSK to remain available and durable that means no downtime is required from MSK end. However, it can affect your producers/consumers because broker may be unavailable for that particular time and the partition leadership moves from one broker to another as brokers are restarted.

Hence it is not expected to see the below error which you are observing during MSK maintenance since in a rolling restart other brokers are available while one goes offline. With that being said, this error seems to be because of poll latency which is an expected behavior during security patching as client will get connection timeout error.

(+) [Consumer clientId=consumer-aws.fct.entityupdate.webhooklistener.consumer-14, groupId=aws.fct.entityupdate.webhooklistener.consumer] consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

Having said that, I would recommend you to increase max.poll.interval.ms value since this latency is impacting max.poll.interval.ms value . Please note that these settings need to be updated at the consumer side and are not configurable at cluster level. You need to update the client.properties to update these parameters.

For a deeper analysis into the issue and gain more insights tailored to your cluster and client configurations, I request you to please reach out to AWS Premium Support team via a support case.

Please find the documentation for best practices : https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html#ensure-high-availability

I hope the above information is helpful.

AWS
answered 6 months ago
  • Thanks for replying.

    Our max.poll.records size is already set to 1.

    Even if max.poll.interval.ms is breached on the client side then the consumer is expected to be kicked out of the group and a new one is to be started by the broker and there should be no stalling of processing on the consumer side. We are okay with a bit of latency during the maintenance period but what we are observing is that consumers completely stop consuming messages from the topic and the lag starts to build up on the broker side.

    The only workaround in this case is to restart the entire service. We have tried to simulate the scenarios by doing multiple restarts of MSK brokers on our side but we are not able to do this. The issue is happening only when actual AWS maintenance is being done. It happened during this month's maintenance and also during last month's maintenance.

    Some more queries 1.We are using AWS Recommended 2.8.1 broker with Kafka client 3.5.1. Does anything need to be changed here? Are you aware of any issues on the broker side with 2.8.1 which may cause this and have been fixed in some recent versions of Kafka? 2. We have enabled broker logs. But it does not have much information. Can you let us know how we can increase the verbosity level of logs on the broker so that we can see if there are some errors on the broker side when we land up on this issue? 3. Is there a way for us to simulate the AWS MSK maintenance on our end easily? We already tried rebooting the brokers on

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions