How can I troubleshoot high CPU usage on one or more brokers in an Amazon MSK cluster?
I need to troubleshoot high CPU utilization on one or more brokers in my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster.
The total CPU utilization for an Amazon MSK cluster is calculated as the sum of the following:
- Percentage of CPU in user space that's defined by the metric CpuUser
- Percentage of CPU in kernel space that's defined by CpuSystem
It's a best practice to keep the total CPU utilization to less than 60% so that 40% of your cluster's CPU is available. Apache Kafka can redistribute CPU load across brokers in the cluster as needed. For example, when Amazon MSK recovers from a broker fault, the available CPU can be used to perform automatic maintenance, such as patching.
The following are some of the most common causes for high CPU utilization in your Amazon MSK cluster:
- The incoming or outgoing traffic is high.
- The number of partitions per broker was exceeded, overloading the cluster.
- You're using a T type instance.
The incoming or outgoing traffic is high
You can monitor the incoming and outgoing traffic to the cluster using the Amazon CloudWatch metrics BytesInPerSec and BytesOutPerSec metrics. If these metrics for a broker have high values or are skewed, then the broker might be experiencing a high CPU usage. The following are some of the causes for high incoming and outgoing traffic:
- The partition count for the topic that gets high traffic isn't spread evenly on the brokers. Or, the producer isn't sending data evenly to all the partitions. Be sure to check your producer partitioning key and update the cluster configuration accordingly. Make sure to configure the partition key in such a way that one partition doesn't get more data than the rest.
- The consumer group is committing offsets very frequently. The traffic from offset commits affect the broker. In such cases, you see a significantly high MessagesInPerSec value for the broker that's the leader for the _consumer_offset topic partition. This is the partition that the consumer group offset is committed to. To resolve this issue, reduce the number of consumer groups or upgrade the size of your instance.
Number of partitions per broker was exceeded
If the number of partitions per broker exceeds the recommended value, your cluster is overloaded. In this case, you might be prevented from doing the following:
- Update the cluster configuration.
- Update the Apache Kafka version for the cluster.
- Update the cluster to a smaller broker type.
- Associate an AWS Secrets Manager secret with a cluster that has SASL/SCRAM authentication.
Having too many partitions causes performance degradation because of high CPU utilization. This means that each partition uses some amount of broker resources, even when there is little traffic. To address this issue, try the following:
- Delete stale or unused topics to bring the partition count within the recommended limit.
- Scale up the broker instance type to the type that can accommodate the number of partitions that you need. Also, try adding more brokers and reassigning partitions.
Note that partitions aren't automatically reassigned when you add brokers. You must run the kafka-reassign-partitions command. For more information, see Re-assign partitions after changing cluster size.
You're using a T type instance
Note that T type instances have a baseline performance with some burstable features. These instances allow you a baseline performance of 20% CPU utilization. If you exceed this value, then the instance type starts to use the CPU credits. When the utilization is less than 20%, then CPU credits are accrued.
Be sure to monitor the CPU credit balance metric in Amazon CloudWatch. Credits are accrued in the credit balance after they are earned and removed from the credit balance when they are spent. The credit balance has a maximum limit that's determined by the instance size. After this limit is reached, any new credits that are earned are discarded. For T2 Standard instance type, launch credits don't count towards the limit.
The CPU credits indicated by the CPUCreditBalance metric are available for the instance to spend to burst beyond its baseline CPU utilization. When an instance is running, credits in CPUCreditBalance don't expire. When a T4g, T3a or T3 instance stops, the CPUCreditBalance value persists for seven days. After seven days, you lose all accrued credits. When a T2 instance stops, the CPUCreditBalance value doesn't persist, and you lose all accrued credits. CPU credit metrics are available at a five-minute frequency.
Make sure that you monitor the baseline CPU usage and credit balance for any cluster that's running on T type instances. If the CPU usage is more than the baseline, and no more credits are left to spend, then the cluster experiences performance issues.
- Number of connections to the client is high: A spike in any of the following CloudWatch metrics might cause the broke CPU usage to increase. To troubleshoot this issue, monitor these metrics on CloudWatch. Then, reduce the connection count as needed or scale up the broker type.
- The number of consumer groups is high: If the number of consumer groups is high (for example, more than 1000), the CPU usage for the broker might increase. High CPU usage might also result when a consumer group is committing offsets too frequently. To resolve this issue, reduce the number of consumer groups or upgrade the size of your instance.
- Amazon MSK detects and recovers from a broker fault: In this case, Amazon MSK performs an automatic maintenance operation such as patching, resulting in an increased CPU usage.
- A user requests a broker type change or version upgrade: In this case, Amazon MSK deploys rolling workflows that take one broker offline at a time. When brokers with lead partitions go offline, Apache Kafka reassigns partition leadership to redistribute work to other brokers in the cluster. Monitor the CPU usage for these brokers and make sure that you have sufficient CPU headroom in your cluster to tolerate operational events.
- The CPU usage for one or more brokers is high because of skewed data distribution: For example, if two brokers out of six are written to and consumed the most, then they see a higher CPU usage. To address this issue, make sure that you use the round robin technique so that partitions across the cluster are well distributed. Run the topic describe command to see how the partitions are distributed across the cluster. The output might look similar to the following:
bin/kafka-topics.sh -bootstrap-server $MYBROKERS --describe --topic my-topic Topic:my-topic PartitionCount:7 ReplicationFactor:3 Configs: Topic: my-topic Partition: 0 Leader: 1 Replicas: 1,3,2 Isr: 1,3,2 Topic: my-topic Partition: 1 Leader: 2 Replicas: 2,1,3 Isr: 2,1,3 Topic: my-topic Partition: 2 Leader: 3 Replicas: 3,2,1 Isr: 3,2,1 Topic: my-topic Partition: 3 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3 Topic: my-topic Partition: 4 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1 Topic: my-topic Partition: 5 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2
- You turned on open monitoring: If you turned on open monitoring with Prometheus and the scrape interval is low, it might lead to a high number of emitted metrics. This leads to an increase in the CPU usage. To resolve this issue, increase the scrape interval.