How do I troubleshoot increased CPU usage in my ElastiCache for Redis self-designed cluster?

5 minute read
0

I want to troubleshoot increased CPU usage in my Amazon ElastiCache for Redis self-designed cluster.

Short description

The following are Amazon CloudWatch CPU metrics for ElastiCache for Redis:

  • EngineCPUUtilization: Reports CPU utilization for the Redis engine thread. Redis is single-threaded. It's a best practice to monitor the EngineCPUUtilization metric for nodes with four or more vCPUs.
  • CPUUtilization: Indicates the CPU utilization percentage for the host. For smaller nodes with two or less vCPUs, use the CPUUtilization metric to monitor the cluster workload.

Resolution

Troubleshoot high EngineCPUUtilization

To troubleshoot high EngineCPUUtilization, check for the following:

  • Long-running commands that consume high CPU time: Commands with high time-complexity such as keys, hkeys, and hgetall consume high CPU time. To check a command's time complexity and performance suggestions, see Commands on the Redis website. If you use Lua scripts, then all server activities are blocked during the runtime and any EngineCPUUtilization increases. Lua scripts are used by the EVAL and EVALSHA Redis commands. For more information, see Scripting with Lua, EVAL, and EVALSHA on the Redis website. To check for long-running commands or Lua scripts, use Redis SLOWLOG.
  • A high number of requests: Check the commands statistics to identify command bursts or increased latency. To check commands statistics, use CloudWatch metrics such as GetTypeCmds or HashBasedCmds. Or, use the Redis INFO command. For more information, see INFO on the Redis website. If you have a high number of requests and your application workload is as expected, then scale the cluster.
  • Backup and replication: If backup or replication occurred, then check the SaveInProgress metric. This binary metric shows 1 when a background save (forked or forkless) is in progress and shows 0 when a background save isn't in progress. Make sure that there is enough memory to create a Redis snapshot.
  • High number of NewConnections: A high number of new client connection requests in a short time period might cause an increase in EngineCPUUtilization. For best practices when you handle a large number of connections, see Best practices: Redis clients and Amazon ElastiCache for Redis. For Redis 6.2 and newer, performance improvements were implemented. For more information, see ElastiCache for Redis 6.2 (enhanced).
  • High number of key evictions: Redis evicts keys based on the maxmemory-policy parameter. Evictions occur when the cache doesn't have enough memory to hold new data. If the eviction volume is high, then Redis uses more CPU resources to evict the keys, and EngineCPUUtilization increases. To monitor the eviction volume, use the CloudWatch metric Evictions. If the eviction volume is high, then use a larger node type or add more nodes to scale your cluster.
  • High number of reclaim: To free up memory, Redis samples and deletes any keys that reached their timeout expiration. This process is called reclaim. If there's a high number of expirations, CPUUtilization and EngineCPUUtilization might increase. To monitor the number of key expiration events, use the CloudWatch metric Reclaimed. It's a best practice to make sure that too many keys aren't expiring at the same time. To make sure that your keys expire at different time windows, use the EXPIREAT command, or set different TTL values for your keys. For more information, see EXPIREAT on the Redis website.

Troubleshoot high CPUUtilization

To troubleshoot high CPUUtilization, check for the following:

  • High network traffic or connections: High network traffic or connection might lead to increased CPUUtilization on Amazon ElastiCache Redis. To check for high network traffic or connections, check the NewConnections, NetworkBytesIn, NetworkBytesOut, NetworkPacketsIn, and NetworkPacketsOut CloudWatch metrics.
  • Asynchronous I/O that's handled by other threads: For supported node types, enhanced I/O is designed to handle network I/O on dedicated threads. Also, for Redis version 6.2 and newer, TLS offloading is supported and allows ElastiCache for Redis to perform TLS operations on the I/O threads. TLS operations use the extra CPU core available in nodes, and this extra CPU use might result in increased CPUUtilization. For more information, see Amazon ElastiCache performance boost with Amazon EC2 M5 and R5 instances.
  • Continuous managed maintenance and service updates: Maintenance and service updates require compute capacity and might result in an increase in CPUUtilization. Make sure that you check the maintenance window to see if the increase in CPUUtilization occurs at the same time. It's a best practice to set the maintenance window to a time period of low usage. For more information, see Amazon ElastiCache managed maintenance and service updates help page.
  • High paging and operations: Insufficient memory on the node might cause the kernel to page out memory to swap. This action can lead to performance degradation. If the paging is excessive, then CPUUtilization might increase. Also, if the load on the node is high when operations such as backup or scaling occurs, then CPUUtilization might increase. For more information, see Monitoring best practices with Amazon ElastiCache for Redis using Amazon CloudWatch.
AWS OFFICIAL
AWS OFFICIALUpdated 10 months ago