By using AWS re:Post, you agree to the Terms of Use
/Data loss while MSK is in an HEALING state./

Data loss while MSK is in an HEALING state.


Hi there, We had about 10 mins downtime today while MSK was in a HEALING state. But according to the doc available online. A healing state should not affect the cluster to produce or consume. Is there a reason why we had downtime? Below is our cluster configuration


I found these logs on the consumer side.

"error":"Messages are rejected since there are fewer in-sync replicas than required","correlationId":5,"size":57}

Kindly advise if anything is wrong with the configuration. Thanks

1 Answers

Hi there, how many brokers you have? If the following best practices are followed (1) it wouldn't cause any downtime when one broker is down. When one broker is down it is possible a partition will have one less replica then it is configured to have. So that is the reason replication factor should be greater than min.insync.replicas by 1 i.e RF = minISR+1. Based on the above error, it seems you have acks=ALL which requires atleast 2 replicas to be insync (based on min.insync.replicas=2 in your config) but seems less than 2 replicas were in sync at the time so the error.

Even though default.replication.factor=3, it only applies to auto created topics. If you create topics manually you will have to specify replication factor while topic creation. Please make sure RF for topics is 3 if it is 3AZ cluster. If it is 2AZ cluster please decrease the value of min.insync.replicas to 1.


answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions