Skip to content

How do I troubleshoot high CPU usage on one or more brokers in an Amazon MSK cluster?

6 minute read
0

I want to troubleshoot high CPU utilization on one or more brokers in my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster.

Resolution

The total CPU utilization for an Amazon MSK cluster is the sum of the following values:

  • Percentage of CPU in user space that's defined by the metric CpuUser
  • Percentage of CPU in kernel space that's defined by CpuSystem

It's a best practice to keep the total CPU utilization under 60% so that 40% of your cluster's CPU is available. Apache Kafka can redistribute CPU load across brokers in the cluster as needed. For example, when a broker fault occurs, Amazon MSK can use the available CPU to perform automatic maintenance, such as patches.

Your Amazon MSK cluster might have high CPU utilization for one of the following reasons:

  • The incoming or outgoing traffic is high.
  • You exceed the number of partitions per broker and overload the cluster.
  • You use a T instance type.

The incoming or outgoing traffic is high

To monitor the incoming and outgoing traffic to the cluster, use the Amazon CloudWatch metrics, BytesInPerSec and BytesOutPerSec. If these metrics for a broker have high values or they're skewed, then the broker might experience high CPU usage.

Brokers might experience high traffic when high volume topics have uneven partition distribution. Or, the producer doesn't distribute data evenly to all the partitions. To resolve this issue, check your producer partitioning key and update the cluster configuration. Then configure the partition key so that one partition doesn't get more data than the rest.

You might also experience high traffic when the consumer group commits offsets very frequently. The traffic from offset commits affects the broker. To resolve this issue, reduce the number of consumer groups or upgrade your instance size of your instance.

Number of partitions per broker exceed the recommended value

If the number of partitions per broker exceed the recommended value, then your cluster overloads. When your cluster overloads, you can't take the following actions:

  • Update the cluster configuration.
  • Update the Apache Kafka version for the cluster.
  • Update the cluster to a smaller broker type.
  • Associate an AWS Secrets Manager secret with a cluster that has SASL/SCRAM authentication.

When you have too many partitions, you might have high CPU utilization and experience performance degradation.

To resolve this issue, take the following actions:

  • Delete stale or unused topics to bring the partition count within the recommended limit. To identify the unused topic, turn on topic level monitoring and check the BytesInPerSec and BytesOutPerSec metrics at topic level to see whether there's any traffic that flows through the topic. If there is no traffic that flows through, then you can delete the unused topics.
  • Scale up the broker instance type to a type that can accommodate the number of partitions that you need. Also, add more brokers and reassign partitions.

Note: You must run the kafka-reassign-partitions command to reassign partitions. Amazon MSK doesn't automatically reassign partitions when you add brokers.

You use a T instance type

T instance types have a baseline performance with some burstable features. These instances allow you a baseline performance of 20% CPU utilization. If you exceed this value, then the instance type starts to use the CPU credits. When the utilization is less than 20%, then you accrue CPU credits.

Make sure to monitor the CPU credit balance metric for burstable instances in Amazon CloudWatch.

Monitor the baseline CPU usage and credit balance for any cluster that runs on T instance types. If CPU usage is more than the baseline, and there are no more credits left to spend, then the cluster experiences performance issues.

Other Possible Causes

Number of connections to the client is high

A spike in any of the following Amazon CloudWatch metrics might cause the broker CPU usage to increase:

  • ConnectionCount
  • ConnectionCreationRate
  • ConnectionCloseRate

To troubleshoot this issue, monitor these metrics on Amazon CloudWatch. Then, reduce the connection count as needed or scale up the broker type.

Amazon MSK detects and recovers from a broker fault

When Amazon MSK detects a broker fault and performs an automatic maintenance operation such as a patch, there is an increase in CPU usage. As soon as the Amazon MSK completes the cluster operation, CPU usage falls to normal usage levels.

A user requests a broker type change or version upgrade

When a user requests a brokery type change or version upgrade, Amazon MSK deploys rolling workflows that take one broker offline at a time. When brokers with lead partitions go offline, Apache Kafka reassigns partition leadership to redistribute work to other brokers in the cluster. Monitor CPU usage for these brokers and make sure that you have sufficient CPU headroom in your cluster to tolerate operational events.

CPU usage for one or more brokers is high because of skewed data distribution

CPU usage for one or more brokers might be high because of skewed data distribution. For example, if you write to two brokers out of six and those brokers are consumed the most, then they see higher CPU usage. To address this issue, make sure that you use the round robin technique so that partitions across the cluster are well distributed.

To see how the Apache Kafka cluster controller distributes the partitions across the cluster, run the following command:

bin/kafka-topics.sh -bootstrap-server $MYBROKERS --describe --topic my-topic

Example output:

Topic:my-topic    PartitionCount:7 ReplicationFactor:3 Configs:
    Topic: my-topic    Partition: 0 Leader: 1 Replicas: 1,3,2 Isr: 1,3,2
    Topic: my-topic    Partition: 1 Leader: 2 Replicas: 2,1,3 Isr: 2,1,3
    Topic: my-topic    Partition: 2 Leader: 3 Replicas: 3,2,1 Isr: 3,2,1
    Topic: my-topic    Partition: 3 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
    Topic: my-topic    Partition: 4 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1
    Topic: my-topic    Partition: 5 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2

You turned on open monitoring

If you turn on open monitoring with Prometheus and the scrape interval is low, then it might lead to a high number of emitted metrics and then an increase in CPU usage. To resolve this issue, increase the scrape interval. It's a best practice to not exceed 1 scrape per minute per broker to preserve the performance of your Amazon MSK cluster. By default, scrape interval happens every 10-15 seconds.