Skip to content

How do I understand and troubleshoot memory metrics and usage in Amazon MSK clusters?

3 minute read
0

I see high-memory usage or low free memory values in my Amazon MSK cluster, and I want to understand these memory metrics.

Resolution

Apache Kafka can support high-memory usage or low free memory values. Kafka uses as much available memory as possible to improve performance through disk caching. Because of Kafka's design, high MemoryUsed or low MemoryFree values don't always indicate a problem.

Understand memory metrics

To understand Kafka's memory usage, review the following Kafka memory metrics:

  • Use the MemoryFree metric to measure the unused memory on the host that sits idle and isn't allocated for any purpose.
  • Use the MemoryCached metric to measure the memory that the operating system uses to cache disk contents. This reduces disk access latency and increases I/O performance.
  • Use the MemoryBuffered metric to measure the memory that the operating system uses to buffer I/O operations to disk.
  • Use the MemoryUsed metric to measure the total amount of memory that is currently in use. It's calculated as Total memory - (MemoryFree + Memory Cached + MemoryBuffered).

Monitor critical health metrics

Because Kafka uses high memory to optimize workloads, make sure that you don't only review the MemoryUsed metric to assess cluster health.

To monitor critical health metrics, take the following actions:

  • Use the HeapMemoryAfterGC metric to monitor the actual memory used for Kafka operations after garbage collection. It's a best practice to create a CloudWatch alarm to trigger when the HeapMemoryAfterGC metric exceeds the 60% threshold.
  • Monitor the ActiveControllerCount metric. Make sure that only one controller is active per cluster.
  • Use the CpuUser metric to monitor the percentage of CPU in user space.
  • Use the CpuSystem metric to monitor the percentage of CPU in kernel space. It's a best practice to keep CPU usage (CpuUser + CpuSystem) below 60%. Create a CloudWatch alarm for sustained periods above a 60% threshold.
  • Use the KafkaDataLogsDiskUse metric to monitor the percentage of disk space used for data logs. Create a CloudWatch alarm to trigger at an 85% threshold to prevent disk space exhaustion.

For more information, see Default level monitoring.

The alarms help you proactively identify and address performance issues before the issues impact your cluster. If these recommended metrics are within normal ranges, then your cluster is healthy, even with high memory usage.

Related information

Memory running low

Monitor Apache Kafka memory

Monitor disk space

View Amazon MSK metrics using CloudWatch

Using Amazon CloudWatch alarms