How is it possible to remove a specific broker from an MSK cluster (due to 100% CPU)?

0

Recently we went through one of the worst incidents i have been a part of. Much of our infrastructure is supported by Kafka for the various event messages that the different applications create, among other uses. What happened is that one of the brokers managed to get itself corrupted, or more accurately in a very bad state, where its CPU went to 100% (normal CPU, ~15%), making it very challenging for producers. This node was also a leader on several partitions in each topic, although there was sufficient RF (2, for each topic, across 4 brokers) where if a node went down there would not be data loss. Producers would either take a very, very long time to produce a message (> 1000ms), or worse still, the operation would fail. Although the applications were configured to timeout after a certain amount of time, or, to swallow the errors, to avoid service interruption should the kafka operation fail, many of our processes depend on that data. And frankly, this situation wasn't really something we were well prepared for. all of the discussions had always been around "what if a node fails", and not, "What if a node is operating so poorly, it affects all applications connected to this cluster". Restarting the node was an option, and... didn't really help. After restart it seemed more improved, back down to normal, but didn't take that long before it hit 100% cpu again and causing issues with producers all through the environment.

On the MSK screen, there is the choice to remove nodes, but this seems to be specifically about downsizing the cluster. we do not get to choose the nodes we want removed, and we must remove the same number of nodes in each AZ. so this option seems unfit for this purpose. Restarting is an option, but not the one that we want. If this had been my bare metal kafka cluster we would have just removed the offending broker, probably add a replacement and be done with it.

What we ended up doing was altering the machine size of the brokers, forcing each broker to be rebuilt, which ultimately solved the issue with our offending broker. I am hoping there would be a better way in which to handle this. It's not the end of the world if that is our only option, or, whether we would be better off with a self-managed cluster. but now we know it is entirely possible to suffer from the poison pill effect, i am keen to know what the experts would recommend should such a thing or similar happen again. one hopes the likelihood of a repeat would be low (not seen this behavior in 15 years of working with kafka), but we have to be able to recover gracefully and in a timely fashion, which make the broker resize methodology slightly less attractive (takes about 1 hour, and has caused some challenges with consumer offsets getting corrupted and issues downstream). suggestions / advice welcome.

Furthermore, documentation suggests that after some time our cluster would go into a healing state, or that the managed service aspect would attempt to repair our not-very-responsive broker. This did not happen. It was in this state of high CPU for many hours (> 12). it took some time to actually see this was happening, through too many failed attempts to produce on the upstream clients, then we had the whole debacle around restarting, or adding more brokers to the cluster and trying to reassign partitions away from the poisoned broker. Incidentally it was impossible to reassign all of the partitions because the broker in question was unresponsive; the operation would simply hang and wait on those partitions where the damaged broker was the assigned leader.

Edit: using v.2.8.1 for anyone interested. the AI answer seems to suggest that we can remove the specific broker via the console, or via the update_broker_count, but, and possibly it is my lack of basic comprehension skills, i don't see it asking for the broker-id, or arn of the offending broker, so a little bit puzzled on how that might happen. the worst possible thing to happen would be to remove healthy broker(s), leaving our poisoned broker responsible for a greater % of requests / partitions.

1 Answer
0

Amazon MSK now supports the removal of specific brokers from provisioned clusters, which can be helpful in situations like the one you described. This feature allows you to remove problematic brokers without having to resize the entire cluster or migrate to a new one.

To remove a specific broker:

  1. First, you need to move all partitions off the problematic broker. This can be done using Kafka admin APIs or tools like Cruise Control.

  2. Once partitions are moved, you can remove the broker using the AWS Console, CLI, or API. The UpdateBrokerCount API can be used for this purpose.

  3. You can remove multiple brokers in a single operation, but there's a limit of removing one broker per Availability Zone at a time.

  4. The target number of broker nodes must be a multiple of the number of availability zones in your cluster.

It's important to note that this feature is supported on Kafka versions 2.8.1 and above, and is available for M5 and M7g based MSK provisioned clusters. It's not supported on t3.small instance types.

Before removing a broker, carefully consider your cluster's minimum broker count based on availability, durability, local data retention, and partition count. Monitor CloudWatch metrics, partition count, and unused topics to ensure you have enough resources on the remaining brokers.

While this feature provides a more targeted approach to dealing with problematic brokers, it's still a good idea to test the process in a non-production environment before implementing it in production.

Regarding the healing state or managed service repairs you mentioned, it's concerning that this didn't occur as expected. In such situations, it's advisable to contact AWS support for assistance, as they can provide more insight into why automatic healing didn't take place and potentially help resolve the issue more quickly.

For future incidents, having a well-defined runbook that includes steps for identifying and removing problematic brokers could help in faster resolution. Additionally, setting up more granular monitoring and alerting for individual broker performance could aid in earlier detection of issues.
Sources
Safely remove Kafka brokers from Amazon MSK provisioned clusters | AWS Big Data Blog
Remove a broker from an Amazon MSK cluster - Amazon Managed Streaming for Apache Kafka
Amazon MSK now supports the removal of brokers from MSK provisioned clusters

profile picture
answered a month ago
  • If we are dependent on having all of the partitions reassigned, then we are in trouble here - because reassigning the partitions where the specific broker was the designated owner was not possible. it may be possible to complete the reassignment operation by forcing a restart of the broker in question. given that the kafka engine should ultimately reassign the partitions itself if those partitions are given to a broker that is no longer part of the cluster / responsive - is it 100% necessary to have a successful partition reassignment before attempting the above? it kind of puts us in a no-win scenario; we have to lose the broker in order to restore service, we have to reassign the partitions in order to lose the broker and we have to lose the broker in order to reassign the partitions...

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions