Does Kafka Security patches causes data Loss
Hi there, We recently got a notification about the Kafka MSK security patches update. We had brokers spread across 3 AZs to avoid data loss according to AWS best practices. Upon the patches completion, we realised the Kafka was shut down during the patches, See the Cloudwatch log below.
here is the Cluster configuration. Please lemme know if I'm missing anything
auto.create.topics.enable=true
default.replication.factor=2
min.insync.replicas=2
num.io.threads=8
num.network.threads=5
num.partitions=1
num.replica.fetchers=2
replica.lag.time.max.ms=30000
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
unclean.leader.election.enable=true
zookeeper.session.timeout.ms=18000
Thanks
Hi there, security patching will trigger a rolling reboot. During this time partition leadership moves from one broker to another as brokers are restarted. Clients can get connection errors with a message saying that Connection refused/Timeout errors or leader is not the same as before, but they request metadata again for the correct leader and automatically retry operations against other available brokers. This could manifest as client side latency but does not impact the functionality of the client and wouldn't cause any data loss as long below best practices are followed
- Ensuring the topic replication factor (RF) is at least 2 for two-AZ clusters and at least 3 for three-AZ clusters. An RF of 1 can lead to offline partitions during patching.
- Set minimum in-sync replicas (minISR) to at most RF - 1 to ensure the partition replica set can tolerate one replica being offline or under-replicated
- Ensure clients are configured to use multiple broker connection strings. Having multiple brokers in a client’s connection string allows for failover if a specific broker supporting client I/O begins to be patched.
Since you have 3AZ, please have RF=3 and minISR=2(this is already set to right number in your config). On producer side configuration please make sure you have enough retries set and since it can take few milliseconds for leaders to transfer to another broker, you can set retry.backoff.ms to 50-100ms so that it can wait for few milliseconds before retrying.
Relevant questions
Does Kafka Security patches causes data Loss
asked 5 months agoClarifying MSK data transfer pricing within region or AZ
Accepted Answerasked 3 years agoKafka to Redshift
Accepted Answerasked 2 years agoAws MSK security behaviour when both IAM and SCRAM enabled
Accepted Answerasked 6 months agoData loss while MSK is in an HEALING state.
asked 2 months agoDoes Updating an Active MSK cause data loss
Accepted Answerasked 6 months agoUnable to use IAM permissions to access MSK Brokers
asked a year agoHow to Integrate a Kinesis Data Analytics Flink Application with a Self-Managed Kafka Cluster Running on Amazon EC2 (not Amazon MSK)?
Accepted Answerasked 2 years agoBackup Kafka topics to S3
Accepted Answerasked 3 years agoHow big is the risk when updating the Kafka version?
asked 2 years ago