- Newest
- Most votes
- Most comments
The way you try to simulate AZ failure is incorrect. When you block subnet in your VPC, it doesn't stop brokers to communicate among them. Don't forget, all brokers are set up in managed VPC, not your VPC. Clients communicate with broker over ENIs. So when you block a subnet, all you do is blocking a client from accessing ENI in that subnet. At this time, brokers still can report their heartbeats, they have no issues with replication and acknowledging send requests going to other brokers. Your clients though will be blocked to produce to (consume from) the partitions where the leaders behind the ENI in the blocked subnet.
If you want to test broker temporary failure, you can call MSK API - restart broker. While restarting some leaders will be reelected, and then they will re-elected again, when the broker will come back.
The behavior you're observing with your Amazon MSK cluster during an AZ-level failure simulation is related to how Apache Kafka handles leader elections and broker failures.
Your configuration seems to be set up for high availability with 3 AZs and 1 broker per AZ. However, there are a few factors that could be affecting the leader election process:
-
Leader Election: Kafka doesn't automatically update the leader for a partition unless the current leader becomes completely unavailable or there's a controlled reassignment. In your case, denying access to the subnet might not trigger a leader change if the affected broker is still running and able to communicate with ZooKeeper.
-
Unclean Leader Election: You have set 'unclean.leader.election.enable=true', which allows out-of-sync replicas to become leaders. This setting can help maintain availability but may lead to data loss and might affect the leader election process.
-
Replica Lag: The 'replica.lag.time.max.ms' is set to 30000 (30 seconds). This means a replica has to be out of sync for at least 30 seconds before it's considered offline. This could delay leader changes.
-
Network Threads: With only 5 network threads ('num.network.threads=5'), your cluster might not detect network issues as quickly as it could with more threads.
To improve your cluster's responsiveness to AZ failures:
- Consider increasing 'num.network.threads' to help detect network issues faster.
- You might want to decrease 'replica.lag.time.max.ms' to detect offline brokers more quickly.
- Ensure that your client configurations are set to recognize all brokers and attempt to reconnect if the leader becomes unavailable.
- Monitor the 'UnderReplicatedPartitions' metric to see if replicas are falling out of sync during your tests.
Remember that Kafka is designed to maintain stability, so it won't change leaders unnecessarily. If you need more aggressive failover, you might need to manually reassign partitions or adjust your test methodology to ensure the broker is truly seen as offline by the cluster.
Sources
Manage storage throughput for Standard brokers in a Amazon MSK cluster - Amazon Managed Streaming for Apache Kafka
AWS MSK is not able to load balance records to all the consumers in a consumer group | AWS re:Post
I have tried to consume after few mins still the describe topic gives the old blocked broker as the leader. Also I have denied all access in the broker I need to fail. Also I am passing all the brokers to my cli command
Relevant content
- asked 4 years ago
- asked 5 months ago

Thanks Edbe.