Does Kafka Security patches causes data Loss

1

Hi there, We recently got a notification about the Kafka MSK security patches update. We had brokers spread across 3 AZs to avoid data loss according to AWS best practices. Upon the patches completion, we realised the Kafka was shut down during the patches, See the Cloudwatch log below.

msk.png

here is the Cluster configuration. Please lemme know if I'm missing anything

auto.create.topics.enable=true
default.replication.factor=2
min.insync.replicas=2
num.io.threads=8
num.network.threads=5
num.partitions=1
num.replica.fetchers=2
replica.lag.time.max.ms=30000
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
unclean.leader.election.enable=true
zookeeper.session.timeout.ms=18000

Thanks

profile picture
asked 2 years ago2471 views
1 Answer
1

Hi there, security patching will trigger a rolling reboot. During this time partition leadership moves from one broker to another as brokers are restarted. Clients can get connection errors with a message saying that Connection refused/Timeout errors or leader is not the same as before, but they request metadata again for the correct leader and automatically retry operations against other available brokers. This could manifest as client side latency but does not impact the functionality of the client and wouldn't cause any data loss as long below best practices are followed

  1. Ensuring the topic replication factor (RF) is at least 2 for two-AZ clusters and at least 3 for three-AZ clusters. An RF of 1 can lead to offline partitions during patching.
  2. Set minimum in-sync replicas (minISR) to at most RF - 1 to ensure the partition replica set can tolerate one replica being offline or under-replicated
  3. Ensure clients are configured to use multiple broker connection strings. Having multiple brokers in a client’s connection string allows for failover if a specific broker supporting client I/O begins to be patched.

Since you have 3AZ, please have RF=3 and minISR=2(this is already set to right number in your config). On producer side configuration please make sure you have enough retries set and since it can take few milliseconds for leaders to transfer to another broker, you can set retry.backoff.ms to 50-100ms so that it can wait for few milliseconds before retrying.

AWS
SUPPORT ENGINEER
answered 2 years ago
  • Hi, can't we reschedule or stop automatic patch updates to done manually in MSK?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions