MSK endpoints inaccessible during scheduled security patching activity

0

In our AWS environment, one of the MSK clusters was patched 2 days ago. It was a scheduled change from AWS to apply OS updates. The notification says:

MSK uses an automated rolling update to patch one broker at a time, following Kafka best practices. To ensure client I/O continuity during the rolling update that is performed as part of the patching process, we recommend you review the configuration of your clients and your Apache Kafka topics as follows:

  1. Ensuring the topic replication factor (RF) is at least 2 for two-AZ clusters and at least 3 for three-AZ clusters. An RF of 1 can lead to offline partitions during patching.
  2. Set minimum in-sync replicas (minISR) to at most RF - 1 to ensure the partition replica set can tolerate one replica being offline or under-replicated
  3. Ensure clients are configured to use multiple broker connection strings. Having multiple brokers in a client’s connection string allows for failover if a specific broker supporting client I/O begins to be patched. For information about how to get a connection string with multiple brokers, see Getting the Bootstrap Brokers for an Amazon MSK Cluster [1].

Our MSK cluster complies with all three recommendations; however, we observed that the MSK endpoint was not accessible for a period of 1 hour.

Is it expected that both endpoints may go down during rolling update? Also, we currently do not have support plan, AWS still does not provide a way from dashboard to re-schedule the updates during non-production hours.

Vipul
asked 2 months ago307 views
1 Answer
1

Hello there,

Our MSK cluster complies with all three recommendations; however, we observed that the MSK endpoint was not accessible for a period of 1 hour.

This is unexpected behaviour. This needs to be investigated further. Please check the ActiveControllerCount metric to see how much time the broker went down. Also, please check the partitionCount metric to see which broker went down for how much time. Also, you can check the broker logs if enabled. If needs, please raise a support case to investigate further and to avoid this issue in future.

Is it expected that both endpoints may go down during rolling update?

No, during any update and rolling restart only one broker goes down at a time..

Also, we currently do not have support plan, AWS still does not provide a way from dashboard to re-schedule the updates during non-production hours.

This can't be done from your end. You need to reach out to our Support team to request the reschedule of your maintenance window to non-production hours. Also, you can ask us to reschedule it to particular time and day of every month.

Ex: Every 1st Sunday at 02:00AM UTC

AWS
SUPPORT ENGINEER
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions