I see high-memory usage or low free memory values in my Amazon MSK cluster, and I want to understand these memory metrics.
Resolution
Apache Kafka can support high-memory usage or low free memory values. Kafka uses as much available memory as possible to improve performance through disk caching. Because of Kafka's design, high MemoryUsed or low MemoryFree values don't always indicate a problem.
Understand memory metrics
To understand Kafka's memory usage, review the following Kafka memory metrics:
- Use the MemoryFree metric to measure the unused memory on the host that sits idle and isn't allocated for any purpose.
- Use the MemoryCached metric to measure the memory that the operating system uses to cache disk contents. This reduces disk access latency and increases I/O performance.
- Use the MemoryBuffered metric to measure the memory that the operating system uses to buffer I/O operations to disk.
- Use the MemoryUsed metric to measure the total amount of memory that is currently in use. It's calculated as Total memory - (MemoryFree + Memory Cached + MemoryBuffered).
Monitor critical health metrics
Because Kafka uses high memory to optimize workloads, make sure that you don't only review the MemoryUsed metric to assess cluster health.
To monitor critical health metrics, take the following actions:
- Use the HeapMemoryAfterGC metric to monitor the actual memory used for Kafka operations after garbage collection. It's a best practice to create a CloudWatch alarm to trigger when the HeapMemoryAfterGC metric exceeds the 60% threshold.
- Monitor the ActiveControllerCount metric. Make sure that only one controller is active per cluster.
- Use the CpuUser metric to monitor the percentage of CPU in user space.
- Use the CpuSystem metric to monitor the percentage of CPU in kernel space. It's a best practice to keep CPU usage (CpuUser + CpuSystem) below 60%. Create a CloudWatch alarm for sustained periods above a 60% threshold.
- Use the KafkaDataLogsDiskUse metric to monitor the percentage of disk space used for data logs. Create a CloudWatch alarm to trigger at an 85% threshold to prevent disk space exhaustion.
For more information, see Default level monitoring.
The alarms help you proactively identify and address performance issues before the issues impact your cluster. If these recommended metrics are within normal ranges, then your cluster is healthy, even with high memory usage.
Related information
Memory running low
Monitor Apache Kafka memory
Monitor disk space
View Amazon MSK metrics using CloudWatch
Using Amazon CloudWatch alarms