I want to maintain high availability in my Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters during security patches.
Resolution
Amazon MSK uses rolling updates to maintain high availability and support cluster I/O during patching. To prevent downtime during security patching, configure your cluster with the following high availability settings.
Set up a three-Availability Zone cluster
To prevent downtime when an Availability Zone fails, set up a three-Availability Zone cluster.
Note: By default, Express brokers distribute your data across three Availability Zones.
To achieve a rack aware replication assignment for fault tolerance, Amazon MSK sets the broker.rack broker property at the Availability Zone level. When you use a three-Availability Zone cluster with a replication factor (RF) of 3, each of the three partition replicas are in a separate Availability Zone.
Note: A two-Availability Zone cluster with an RF of 3 can't place each of the three partition replicas to be in a separate Availability Zone. Amazon MSK doesn't allow you to create a cluster in a single Availability Zone.
Set the replication factor equal to the Availability Zone count
When Amazon MSK restarts a broker during a security patch, the leader becomes unavailable. As a result, the Apache Kafka process elects one of the follower replicas as the new leader so it can continue to service clients.
An RF of 1 might lead to offline partitions during a rolling update because the cluster doesn't have any replicas to promote as a new leader. An RF of 2, with a minimum in-sync replica (minISR) of 1, might result in data loss, even when producer acknowledgement (acks) is set to all. For a write to be successful, a minISR of 1 requires only the leader to acknowledge the write. If the leader replica's broker goes down immediately after the acknowledgement but before the follower replica catches up, then data loss occurs. For more information, see the min.insync.replicas on the Apache Kafka website.
Note: If you use Express brokers, then by default RF is set to 3.
Set minISR to RF - 1 or lower
If you set the minlSR to the value of the RF, then you might experience producer failures when one broker is out of service. If the replicas don't send an acknowledgement for the producer to write, then the producer raises an exception. For example, if the Availability Zone is 3 and RF is 3, then the producer waits for all three partition replicas, including the leader to acknowledge the messages. When one of the brokers is out of service, only two of the three partitions return the acknowledgements, and that results in a producer exception.
Note: If you use Express brokers, then by default the minISR is set to 2.
When you set producer acks to all, the record isn't lost as long as at least one in-sync replica remains alive. For more information, see acks on the Apache Kafka website.
Note: minISR has topic level settings and broker level settings. Topic-level settings always override broker-level settings.
Include at least one broker from each Availability Zone in the client connection string
The client uses a single broker's endpoint to bootstrap a connection to the cluster. During the initial connection, the broker sends metadata with information about the brokers that the client must access.
If only one broker is in the client's connection string and that broker becomes unavailable during patching, then the client fails to establish an initial connection. However, if you have multiple brokers in the connection string, then the connection string allows failover when the broker used to establish the connection goes offline. For more information, see Get the bootstrap brokers for an Amazon MSK cluster.
Allow retries
When Amazon MSK reboots a broker during patching, leader partitions on that broker become unavailable. As a result, Apache Kafka promotes replica partitions on another broker as new leaders. The client requests a metadata update to locate the new leader partitions. During this transition, even with proper configuration, your client might experience one of the following transient errors:
- "NETWORK_EXCEPTION: Disconnected from node"
- "NotLeaderOrFollowerException: Broker is not the current leader"
Kafka clients automatically resolve these errors within 1-3 seconds. By default, clients have retries built in to handle these transient errors. Confirm that you configured your client for retries. For more information, see the retries on the Apache Kafka website.
Note: The metadata.max.age.ms setting controls how frequently clients refresh metadata. By default, it’s set to 300000ms (5 minutes). If you want to failover, then lower metadata.max.age.ms value. For more information, see metadata.max.age.ms on the Apache Kafka website.
Monitor key metrics during patching operations
Use Amazon CloudWatch to monitor the following metrics during patching operations:
- UnderMinIsrPartitionCount
- UnderReplicatedPartitions
- LeaderCount
- CPUUser + CPUSystem (CPU usage)
UnderMinIsrPartitionCount must always be 0.
UnderReplicatedPartitions might temporarily increase during patching. However, it returns to 0 after patching.
LeaderCount must redistribute across the remaining online brokers during patching.
Related information
Patching on MSK Provisioned clusters