I want to troubleshoot the high heap memory usage on my Amazon MSK (Amazon Managed Streaming for Apache Kafka) cluster.
Resolution
Monitor the HeapMemoryAfterGC metric
Use Amazon CloudWatch to monitor the HeapMemoryAfterGC metric that shows the percentage of total heap memory that remains in use after garbage collection is complete. Monitor HeapMemoryAfterGC instead of the MemoryFree or MemoryUsed metrics.
It's a best practice to keep HeapMemoryAfterGC below 60%. When HeapMemoryAfterGC exceeds 60%, your Apache Kafka cluster might experience performance degradation. Create an Amazon CloudWatch alarm that triggers when HeapMemoryAfterGC exceeds 60% and another CloudWatch alarm for when it reaches the 80% threshold.
Reduce client connections
A high number of connections or short connections can cause high heap memory usage in an Amazon MSK cluster. To improve performance, close unnecessary active client connections and reduce the overall number of connections where possible.
Monitor the connections-count, connection-close-rate, and connection-creation-rate metrics. If the three metrics and HeapMemoryAfterGC are high, then reduce the connection count. For more information, see the Client connections section in Apache Kafka client performance.
Manage consumer groups
Frequent consumer offset commits or a high number of consumer groups can cause high heap memory usage. To resolve this issue, remove unused consumer groups.
Also, decrease the offsets.retention.minutes value. The default value is 7 days (10080 minutes). For more information, see offsets.retention.minutes on the Kafka website.
Improve message size and partitions
If partition distribution is uneven across brokers or you exceed partitions for each broker in a cluster, then you might experience high heap memory usage.
To resolve this issue, take the following actions:
- If your message.max.bytes value is high, then reduce it. The default value is 1 MB. Significantly larger values, such as 16 MB, can cause increased memory usage.
- Use the appropriate number of partitions for each broker.
- Balance partition distribution across brokers.
Manage transactional messages
Too many producer IDs or producer state entries accumulation can cause high heap memory usage.
If your cluster uses transactional message delivery, then decrease the transactional.id.expiration.ms value. For Kafka versions earlier than 3.4.0, reduce the value from 604800000 ms (7 days) to 86400000 ms (1 day).
For Kafka versions 3.4.0 and later, producer.id.expiration.ms controls the producer expiration. By default, producer.id.expiration.ms is set to 1 day. For more information, see Separate configuration for producer ID expiry on the Confluence website.
Review your broker instance type and scaling
High inbound or outbound traffic can cause high heap memory usage. Monitor the BytesInPerSec, BytesOutPerSec, ReplicationBytesInPerSec, and ReplicationBytesOutPerSec metrics. If the three metrics and HeapMemoryAfterGC are high, then reduce the traffic or increase your broker size for more memory.
Improve the segment configuration
Low segment.ms and segment.bytes values can cause too many memory-mapped files.
When the segment.ms value is low, Kafka creates new log segments more frequently that result in too many open file handles. When the cluster reaches the maximum open file quota, heap memory usage increases and brokers become unstable.
To resolve this issue, increase the segment.ms value across topics. The default segment.ms value of 604800000ms (7 days). A log.segment.bytes value of 1073741824 bytes (1GB) is appropriate for most scenarios. However, modify the values for your retention and compaction requirements. For more information, see segment.ms and log.segment.bytes on the Kafka website.
Further troubleshoot
Take the following actions:
- Check your server and cluster configurations and modify them for improved performance and reliability.
- If you activated Tiered Storage for your clusters, then expect 60-70% heap usage when you read data from the remote tier. Monitor the RemoteFetchBytesPerSec metric with HeapMemoryAfterGC.
- If you use Prometheus to monitor your cluster, then set your scrape interval for the Prometheus host configuration to 60 seconds or higher. Longer intervals reduce the amount of metric collection.
- T3 brokers can use CPU credits to temporarily burst performance. If you exceed the baseline, then memory usage increases. You can update to a larger instance type.
- If you still experience issues, then reboot the broker or create a support interaction to engage with AWS Support.
Related information
Memory running low
Monitor disk space
Using Amazon CloudWatch alarms
Best practices for Apache Kafka clients