- Newest
- Most votes
- Most comments
Amazon MSK now supports the removal of specific brokers from provisioned clusters, which can be helpful in situations like the one you described. This feature allows you to remove problematic brokers without having to resize the entire cluster or migrate to a new one.
To remove a specific broker:
-
First, you need to move all partitions off the problematic broker. This can be done using Kafka admin APIs or tools like Cruise Control.
-
Once partitions are moved, you can remove the broker using the AWS Console, CLI, or API. The UpdateBrokerCount API can be used for this purpose.
-
You can remove multiple brokers in a single operation, but there's a limit of removing one broker per Availability Zone at a time.
-
The target number of broker nodes must be a multiple of the number of availability zones in your cluster.
It's important to note that this feature is supported on Kafka versions 2.8.1 and above, and is available for M5 and M7g based MSK provisioned clusters. It's not supported on t3.small instance types.
Before removing a broker, carefully consider your cluster's minimum broker count based on availability, durability, local data retention, and partition count. Monitor CloudWatch metrics, partition count, and unused topics to ensure you have enough resources on the remaining brokers.
While this feature provides a more targeted approach to dealing with problematic brokers, it's still a good idea to test the process in a non-production environment before implementing it in production.
Regarding the healing state or managed service repairs you mentioned, it's concerning that this didn't occur as expected. In such situations, it's advisable to contact AWS support for assistance, as they can provide more insight into why automatic healing didn't take place and potentially help resolve the issue more quickly.
For future incidents, having a well-defined runbook that includes steps for identifying and removing problematic brokers could help in faster resolution. Additionally, setting up more granular monitoring and alerting for individual broker performance could aid in earlier detection of issues.
Sources
Safely remove Kafka brokers from Amazon MSK provisioned clusters | AWS Big Data Blog
Remove a broker from an Amazon MSK cluster - Amazon Managed Streaming for Apache Kafka
Amazon MSK now supports the removal of brokers from MSK provisioned clusters
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 2 years ago
If we are dependent on having all of the partitions reassigned, then we are in trouble here - because reassigning the partitions where the specific broker was the designated owner was not possible. it may be possible to complete the reassignment operation by forcing a restart of the broker in question. given that the kafka engine should ultimately reassign the partitions itself if those partitions are given to a broker that is no longer part of the cluster / responsive - is it 100% necessary to have a successful partition reassignment before attempting the above? it kind of puts us in a no-win scenario; we have to lose the broker in order to restore service, we have to reassign the partitions in order to lose the broker and we have to lose the broker in order to reassign the partitions...