Why is my Amazon MSK cluster going into the HEALING state?

4 minute read

I want to troubleshoot my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster that's in HEALING state.


Your Amazon MSK cluster goes into the HEALING state when the service is running an internal operation to address an issue (Example: brokers are unresponsive). However, you can use the cluster to produce and consume data. You can't perform Amazon MSK API or AWS Command Line Interface (AWS CLI) update operations on the cluster until it returns to the ACTIVE state.

Use the Amazon CloudWatch metrics for Amazon MSK to see why the cluster is in HEALING state:

  1. Open the CloudWatch console.
  2. In the navigation pane, choose Metrics, and then choose All metrics.
  3. In the Browse tab, then choose AWS/Kafka.
  4. Under Metrics, Choose Cluster Name.
  5. Select the cluster that you want to monitor.
    If you see spikes in the ActiveControllerCount or OfflinePartitionsCount metric, they indicate that one or more brokers are unhealthy. This might have caused your cluster to go into the HEALING state.
  6. For broker-level metrics, choose Broker ID, Cluster Name under Metrics.
  7. From the list, select the entries with the cluster name and the metrics CpuUser and CpuSystem. Check if the sum of these two values for all the entries reaches an average of higher than 60% for the cluster. If so, high CPU utilization might have caused the broker to go into the HEALING state. For more information on monitoring CPU usage, see Best practices - Monitor CPU usage.

The following are the common reasons for an Amazon MSK cluster to go into the HEALING state:

  • A node or an Amazon Elastic Block Store (Amazon EBS) volume must be replaced because of a hardware failure.
  • A node doesn't meet the Amazon MSK performance SLA for the broker, and the node must be replaced for optimal performance.

Note that Amazon MSK is a fully managed service. Therefore, brokers have self-managed workflows that perform corrective actions on themselves, such as replacing nodes during failure situations. When an Amazon EBS volume in a broker becomes unhealthy, Amazon MSK observes the state of the volume for a certain period of time. If the volume becomes healthy during this time, no action is performed. If the volume continues to be unhealthy after this period, then Amazon MSK automatically replaces this volume. The cluster goes into the HEALING state when Amazon MSK performs these actions. However, this doesn't affect the availability of the Amazon MSK cluster as long as you follow the best practices. Even when the broker is in HEALING state, the cluster can handle requests from producers and consumers.

Rarely, your cluster might enter into a perpetual HEALING state. This might be caused due to the following reasons:

  • Workload on the cluster is high, and the brokers are being continuously replaced. To avoid this issue, it's a best practice not to use t3.small instances for hosting production clusters. If you're using m5 instances, make sure that you chose right size for your cluster. You can determine the size for your cluster based on your workload and by monitoring your CPU usage. Also, make sure that the number of partitions per broker doesn't exceed the recommended value.
  • The Auto Scaling Group is unable to bring up a new instance. This might happen due to an internal issue or a missing dependency. For example, the AWS Key Management Service (AWS KMS) key that was specified during cluster creation might no longer be accessible.
  • A rare internal event impacted the availability of the underlying Amazon Elastic Compute Cloud (Amazon EC2) instances or caused Amazon EBS latency in an Availability Zone or AWS Region.

If your cluster stays in perpetual HEALING state that's not load induced, then contact AWS Support.

Related information

Cluster states

AWS OFFICIALUpdated a year ago