I want to troubleshoot my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster that's in HEALING state.
Resolution
Your Amazon MSK cluster goes into the HEALING state when the service runs an internal operation to address an issue. For example, when your brokers are unresponsive and Amazon MSK runs an internal operation to fix the unresponsive brokers.
You can continue to use the cluster to produce and consume data even while the cluster is in the HEALING state. However, you can't perform Amazon MSK API or AWS Command Line Interface (AWS CLI) update operations on the cluster until it returns to the ACTIVE state.
You can use the Amazon CloudWatch metrics for Amazon MSK to check why the cluster is in HEALING state.
Complete the following steps:
- Open the Amazon CloudWatch console.
- In the navigation pane, choose Metrics, and then choose All metrics.
- In the Browse tab, choose AWS/Kafka.
- Under Metrics, choose Cluster Name.
- Select the cluster that you want to monitor.
Note: If you see spikes in the ActiveControllerCount or OfflinePartitionsCount metric, then one or more brokers are unhealthy. The unhealthy brokers might have caused your cluster to go into the HEALING state.
- To check broker level metrics, under Metrics, choose Broker ID, Cluster Name.
- From the list, select the entries with the cluster name and the metrics, CpuUser and CpuSystem.
- Check if the sum of the CpuUser and CpuSystem values for all the entries reaches an average of 60% or higher for the cluster. If the average is higher than 60%, then high CPU utilization might have caused the broker to go into the HEALING state. For more information, see Monitor CPU usage.
An Amazon MSK cluster might also go into the HEALING state for one of the following reasons:
- Amazon MSK must replace a node or an Amazon Elastic Block Store (Amazon EBS) volume because of a hardware failure.
- A node doesn't meet the Amazon MSK performance SLA for the broker, and Amazon MSK must replace the node for efficient performance.
Amazon MSK is a fully managed service, so brokers have self managed workflows that perform corrective actions on themselves. For example, when an Amazon EBS volume in a broker becomes unhealthy, Amazon MSK observes the state of the volume for a certain period of time. If the volume becomes healthy during this time, then AWS MSK takes no action. If the volume continues to be unhealthy after this period, then Amazon MSK automatically replaces this volume. The cluster goes into the HEALING state when Amazon MSK performs these actions. However, the Amazon MSK cluster is available as long as you follow the best practices.
Your Amazon MSK cluster is in a perpetual HEALING state
Workload on the cluster is high
If the workload on the cluster is high and AWS MSK continuously replaces the brokers, then your cluster might go into a perpetual HEALING state. To avoid high workload on the cluster, don't use t3.small instances for hosting production clusters. If you use m5 instances, then make sure that you choose the correct size for your cluster. To determine the correct size for your cluster based on your workload, monitor your CPU usage, partition count, or throughput.
Also, make sure that the number of partitions per broker doesn't exceed the recommended value.
The Auto Scaling group can't bring up a new instance
If there's an internal issue such as a missing dependency, then the auto scale group can't bring up a new instance and your cluster goes into a perpetual HEALING state.
For example, you can longer access the AWS Key Management Service (AWS KMS) key that you specified during cluster creation.
An internal event impacts the availability of the EC2 instance
Your cluster might also enter a perpetual HEALING state, for one of the following reasons:
- An internal event affects the availability of the underlying Amazon Elastic Compute Cloud (Amazon EC2) instances.
- An internal even causes Amazon EBS latency in an Availability Zone or AWS Region.
If your cluster stays in perpetual HEALING state and it's not a result of high workloads, then contact AWS Support.
Related information
Understand MSK Provisioned cluster states
Welcome to the Amazon MSK Developer Guide
Monitor an Amazon MSK Provisioned cluster
Best practices for Apache Kafka clients