- Newest
- Most votes
- Most comments
Hi there, security patching will trigger a rolling reboot. During this time partition leadership moves from one broker to another as brokers are restarted. Clients can get connection errors with a message saying that Connection refused/Timeout errors or leader is not the same as before, but they request metadata again for the correct leader and automatically retry operations against other available brokers. This could manifest as client side latency but does not impact the functionality of the client and wouldn't cause any data loss as long below best practices are followed
- Ensuring the topic replication factor (RF) is at least 2 for two-AZ clusters and at least 3 for three-AZ clusters. An RF of 1 can lead to offline partitions during patching.
- Set minimum in-sync replicas (minISR) to at most RF - 1 to ensure the partition replica set can tolerate one replica being offline or under-replicated
- Ensure clients are configured to use multiple broker connection strings. Having multiple brokers in a client’s connection string allows for failover if a specific broker supporting client I/O begins to be patched.
Since you have 3AZ, please have RF=3 and minISR=2(this is already set to right number in your config). On producer side configuration please make sure you have enough retries set and since it can take few milliseconds for leaders to transfer to another broker, you can set retry.backoff.ms to 50-100ms so that it can wait for few milliseconds before retrying.
Relevant content
- asked 4 months ago
- asked 7 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 4 months ago
- How do I use the Microsoft KB number in Patch Manager to install a specific patch or set of patches?AWS OFFICIALUpdated 10 months ago
Hi, can't we reschedule or stop automatic patch updates to done manually in MSK?