MSK endpoints inaccessible during scheduled security patching activity

0

In our AWS environment, one of the MSK clusters was patched 2 days ago. It was a scheduled change from AWS to apply OS updates. The notification says:

MSK uses an automated rolling update to patch one broker at a time, following Kafka best practices. To ensure client I/O continuity during the rolling update that is performed as part of the patching process, we recommend you review the configuration of your clients and your Apache Kafka topics as follows:

  1. Ensuring the topic replication factor (RF) is at least 2 for two-AZ clusters and at least 3 for three-AZ clusters. An RF of 1 can lead to offline partitions during patching.
  2. Set minimum in-sync replicas (minISR) to at most RF - 1 to ensure the partition replica set can tolerate one replica being offline or under-replicated
  3. Ensure clients are configured to use multiple broker connection strings. Having multiple brokers in a client’s connection string allows for failover if a specific broker supporting client I/O begins to be patched. For information about how to get a connection string with multiple brokers, see Getting the Bootstrap Brokers for an Amazon MSK Cluster [1].

Our MSK cluster complies with all three recommendations; however, we observed that the MSK endpoint was not accessible for a period of 1 hour.

Is it expected that both endpoints may go down during rolling update? Also, we currently do not have support plan, AWS still does not provide a way from dashboard to re-schedule the updates during non-production hours.

Vipul
已提问 3 个月前314 查看次数
1 回答
1

Hello there,

Our MSK cluster complies with all three recommendations; however, we observed that the MSK endpoint was not accessible for a period of 1 hour.

This is unexpected behaviour. This needs to be investigated further. Please check the ActiveControllerCount metric to see how much time the broker went down. Also, please check the partitionCount metric to see which broker went down for how much time. Also, you can check the broker logs if enabled. If needs, please raise a support case to investigate further and to avoid this issue in future.

Is it expected that both endpoints may go down during rolling update?

No, during any update and rolling restart only one broker goes down at a time..

Also, we currently do not have support plan, AWS still does not provide a way from dashboard to re-schedule the updates during non-production hours.

This can't be done from your end. You need to reach out to our Support team to request the reschedule of your maintenance window to non-production hours. Also, you can ask us to reschedule it to particular time and day of every month.

Ex: Every 1st Sunday at 02:00AM UTC

AWS
支持工程师
已回答 2 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则