I'm seeing high or increasing CPU usage in my Amazon ElastiCache for Redis cluster. How can I troubleshoot this?
There are two Amazon CloudWatch CPU metrics for ElastiCache for Redis:
- EngineCPUUtilization: This metric reports CPU utilization of the Redis engine thread. Because Redis is single-threaded, it's a best practice to monitor the EngineCPUUtilization metric for nodes with four or more vCPUs.
- CPUUtilization: This metric shows the percentage of CPU utilization for the entire host. For smaller nodes with two vCPUs or less, use the CPUUtilization metric to monitor the cluster workload.
The following are common reasons for high EngineCPUUtilization:
- A long-running command that consumes high CPU time: Commands with high time-complexity such as keys, hkeys, hgetall, and so on, consume higher CPU time. For time-complexity and performance suggestions for each command, see Commands on the redis.io website. Lua scripts (run by EVAL or EVALSHA Redis commands) is an atomic operation in Redis. All server activities are blocked during the entire run time of a Lua script, causing high EngineCPUUtilization. Check if there are long-running commands or a long-running Lua script using Redis Slow log.
- A high number of requests: Check the commands statistics to determine if there are command bursts, or if latency is increasing. You can check command statistic using CloudWatch metrics such as GetTypeCmds or HashBasedCmds. Or, you can use the Redis command info commandstats. If you see a high number of requests due to the expected workload on the application, consider scaling the cluster.
- Backup and replication: Check the SaveInProgress metric to see if backup or replication is occurring. This binary metric returns "1" when a background save (forked or forkless) is in progress. The metric returns "0" if a background save isn't in progress. Make sure that you have enough memory to create a Redis snapshot.
- High number of NewConnections: Establishing a TCP connection is a computationally expensive operation, especially for TLS-enabled clusters. A high number of new client connection requests in a short time period might cause an increase in EngineCPUUtilization. Performance improvements for TLS-enabled clusters using x86 node types with eight vCPUs or more on Graviton2 node types with four vCPUs or more have been implemented since Redis 6.2. For recommendations on handling a large number of connections, see Best practices: Redis clients and Amazon ElastiCache for Redis.
- High number of evictions: Redis evicts keys according to the maxmemory-policy parameter. Eviction happens when the cache doesn't have enough memory to hold new data. If eviction volume is high, then EngineCPUUtilization increases because Redis is busy evicting the keys. Eviction volume can be monitored using CloudWatch metrics Evictions. If eviction is high, then scale your cluster up by using a larger node type, or scale out by adding more nodes.
- High number of reclaim: To free up memory, Redis samples and then deletes any keys that have reached their timeout expiration. This process is called "reclaim." If there is a high number of expirations, EngineCPUUtilization increases because Redis is busy reclaiming the keys. You can monitor the number of key expiration events using the CloudWatch metrics Reclaimed. It's a best practice that you don't expire too many keys at the same time by, for example, running the EXPIREAT Redis command.
For more information on troubleshooting high EngineCPUUtilization, see Troubleshooting connections - CPU usage.
The following are common reasons for high CPUUtilization:
- High network traffic or connections: Check the NewConnections, NetworkBytesIn, NetworkBytesOut, NetworkPacketsIn, and NetworkPacketsOut CloudWatch metrics.
- High EngineCPUUtilization and asynchronous I/O that's handled by other threads: For details on enhanced I/O handling, see Amazon ElastiCache performance boost with Amazon EC2 M5 and R5 instances.
- Continuous managed maintenance, and service updates: Maintenance and service updates need compute capacity. As a result, you might notice a spike in CPUUtilization during these events. Check the maintenance window to see if the spike coincides to window. It's a best practice to set the maintenance window at the time of lowest usage to minimize the impact. For more information, see Amazon ElastiCache managed maintenance and service updates help page.
- High paging and operations such as backup: Insufficient memory on the node can cause kernel page out memory to swap. If the paging is excessive, you might see an increase in CPUUtilization. Similarly, If the load on the node is high during operations such as backup or scaling, you might see an increase in CPUUtilization. For recommendations on metrics to identify the cause of a spike, see Monitoring best practices with Amazon ElastiCache for Redis using Amazon CloudWatch.