Skip to content

MSK AZ Level Fault Tolerance

0

Hi I am trying to test the behavior of an AZ level failure in my MSK. I have configured MSK with 3 AZ as 1 broker in each region. Then I will add a deny to the MSK subnet in a single region. When I did that I noticed that the leader of a given topic will not be updated even though the region is not accessible by other two brokers and the client. Below is my configuration. I have used kafka cli tools as well as confluent kafka python to do the testing

auto.create.topics.enable=true default.replication.factor=3 min.insync.replicas=2 num.io.threads=8 num.network.threads=5 num.partitions=1 num.replica.fetchers=2 replica.lag.time.max.ms=30000 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 socket.send.buffer.bytes=102400 unclean.leader.election.enable=true zookeeper.session.timeout.ms=18000 connections.max.idle.ms=600000

MSK Version used - 3.6.0

Thank You

2 Answers
1
Accepted Answer

The way you try to simulate AZ failure is incorrect. When you block subnet in your VPC, it doesn't stop brokers to communicate among them. Don't forget, all brokers are set up in managed VPC, not your VPC. Clients communicate with broker over ENIs. So when you block a subnet, all you do is blocking a client from accessing ENI in that subnet. At this time, brokers still can report their heartbeats, they have no issues with replication and acknowledging send requests going to other brokers. Your clients though will be blocked to produce to (consume from) the partitions where the leaders behind the ENI in the blocked subnet.

If you want to test broker temporary failure, you can call MSK API - restart broker. While restarting some leaders will be reelected, and then they will re-elected again, when the broker will come back.

AWS
answered a year ago
  • Thanks Edbe.

0

The behavior you're observing with your Amazon MSK cluster during an AZ-level failure simulation is related to how Apache Kafka handles leader elections and broker failures.

Your configuration seems to be set up for high availability with 3 AZs and 1 broker per AZ. However, there are a few factors that could be affecting the leader election process:

  1. Leader Election: Kafka doesn't automatically update the leader for a partition unless the current leader becomes completely unavailable or there's a controlled reassignment. In your case, denying access to the subnet might not trigger a leader change if the affected broker is still running and able to communicate with ZooKeeper.

  2. Unclean Leader Election: You have set 'unclean.leader.election.enable=true', which allows out-of-sync replicas to become leaders. This setting can help maintain availability but may lead to data loss and might affect the leader election process.

  3. Replica Lag: The 'replica.lag.time.max.ms' is set to 30000 (30 seconds). This means a replica has to be out of sync for at least 30 seconds before it's considered offline. This could delay leader changes.

  4. Network Threads: With only 5 network threads ('num.network.threads=5'), your cluster might not detect network issues as quickly as it could with more threads.

To improve your cluster's responsiveness to AZ failures:

  1. Consider increasing 'num.network.threads' to help detect network issues faster.
  2. You might want to decrease 'replica.lag.time.max.ms' to detect offline brokers more quickly.
  3. Ensure that your client configurations are set to recognize all brokers and attempt to reconnect if the leader becomes unavailable.
  4. Monitor the 'UnderReplicatedPartitions' metric to see if replicas are falling out of sync during your tests.

Remember that Kafka is designed to maintain stability, so it won't change leaders unnecessarily. If you need more aggressive failover, you might need to manually reassign partitions or adjust your test methodology to ensure the broker is truly seen as offline by the cluster.
Sources
Manage storage throughput for Standard brokers in a Amazon MSK cluster - Amazon Managed Streaming for Apache Kafka
AWS MSK is not able to load balance records to all the consumers in a consumer group | AWS re:Post

answered a year ago
EXPERT
reviewed a year ago
  • I have tried to consume after few mins still the describe topic gives the old blocked broker as the leader. Also I have denied all access in the broker I need to fail. Also I am passing all the brokers to my cli command

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.