How do I troubleshoot high latency issues in ElastiCache for Valkey or ElastiCache for Redis OSS?
I want to troubleshoot high latency issues in my Amazon ElastiCache for Valkey or Amazon ElastiCache for Redis OSS cluster.
Short description
The following are common causes of high latency issues in ElastiCache for Valkey or ElastiCache for Redis OSS clusters:
- Slow commands
- Increased swap activity from high memory usage
- Network issues
- Client-side latency issues
- Redis synchronization
- Amazon ElastiCache cluster events
Resolution
Slow commands
Because Valkey and Redis OSS clusters are single-threaded, ElastiCache can't serve clients until the current request completes. This slowdown results in an increase in the total time for requests, and causes high latency.
To monitor average latency, you can use Amazon CloudWatch metrics for Valkey and Redis to monitor specific commands. For more information, see Metrics for Valkey and Redis OSS.
To retrieve a list of commands that take longer than 10 ms for the engine to process, use the SLOWLOG GET command. You can connect to the affected node to run the slowlog get 128 command in the valkey-cli.
Also, ElastiCache calculates common Redis operations in microsecond latency. CloudWatch samples metrics every minute and shows latency metrics as an aggregate of multiple commands. A single command can result in minor issues, such as timeouts, without significant changes in the metric graphs.
Slow commands that take a long time to complete can cause increased CPU usage on the ElastiCache node. If there's an increase in the EngineCPUUtilization metric, then see How do I troubleshoot increased CPU usage in my ElastiCache for Redis self-designed cluster?
The following are examples of complex commands that can slow ElastiCache clusters:
- The use of the KEYS command in production environments with large datasets: The KEYS command sweeps the entire keyspace and searches for specified patterns. For more information, see KEYS on the Valkey website.
- Lua scripts that take a long time to run: Depending on how complex a script is or how large a dataset is, Lua scripts can run long and cause latency issues.
Increased swap activity from high memory usage
When there's increased memory pressure on the cluster, Redis swaps memory pages. Because memory pages are transferred to and from the swap area, the swap can increase latency and cause timeouts. The following changes in CloudWatch metrics are signs that there's an increase in swap activity:
- Increased SwapUsage
- Low FreeableMemory
- High BytesUsedForCache and DatabaseMemoryUsagePercentage metrics
To troubleshoot increased swap activity, review the following articles:
- How do I resolve the increase in swap activity in my ElastiCache instances?
- How do I check memory usage in my ElastiCache for Redis self-designed cluster and implement best practices to control high memory usage?
Network issues
Network issues can cause high latency for your clusters. Based on your network issue, complete the following tasks to troubleshoot high latency issues.
Network latency between the client and the ElastiCache cluster
To reduce latency between the client and your ElastiCache cluster, you can isolate the network latency between the client and cluster nodes. For more information, see How do I troubleshoot network performance issues between EC2 Linux or Windows instances in a virtual private cloud (VPC) and an on-premises host over the internet gateway?
The cluster reaches network limits
An ElastiCache node shares the same network limits as the related Amazon Elastic Compute Cloud (Amazon EC2) instances. For example, the network limits for the ElastiCache cache.m6g.large node type and the Amazon EC2 m6g.large instance are the same. For more information about supported ElastiCache node types and network bandwidth limits, see Supported node types.
To troubleshoot ElastiCache node network limits, see Network-related limits.
Note: It's a best practice to monitor your Amazon EC2 instance network performance, bandwidth capability, packet-per-second (PPS) performance, and connections tracked.
TCP/SSL handshake latency
When clients connect to Redis clusters, it can take a few milliseconds to create the TCP connection. During that time, the delay can result in additional strain on your Redis operations and your ElastiCache node CPU. When you have many new connections, the strain can cause high latency for your network.
To control the volume of your connections and reduce latency, you can use a connection pool to cache established TCP connections into a pool. To configure a connection pool, use your Redis client library. Or, you can manually build your connection pool.
To optimize your connection pool, you can also use aggregated commands, such as MSET or MGET, or Redis pipelines. For more information, see Redis pipelining on the Redis website.
There are a large number of connections on the ElastiCache node
If there are a large number of TCP connections on an ElastiCache node, then you might exhaust the maxclients limit. When you reach this limit, you get an "ERR max number of clients reached error," and you can experience connection timeouts.
To reduce high latency, it's a best practice to track the CurrConnections and NewConnections CloudWatch metrics. You can monitor these metrics to see the number of TCP connections that your ElastiCache node has. To resolve issues when you exhaust your maxclients limit, see the Large number of connections section of Best practices: Redis clients and Amazon ElastiCache for Redis.
Client-side latency issues
If you configure client resources with timeout values that are too low, then you might receive timeout errors. To determine if client resources cause latency issues, check the memory, CPU, and network utilization on the client side. If these resources are near their limits, then configure the client side timeout values to a larger value so that the resource can respond.
If your application runs on an Amazon EC2 instance, then you can use CloudWatch metrics to further identify issues. Or, use a monitoring tool inside the Amazon EC2 instance, such as atop or CloudWatch agent.
To determine if the client is the cause of high latency, look for the following issues:
- Check whether the timeouts frequently occur or at a specific time of the day.
- Check whether the timeouts occur at a specific client or multiple clients.
- Check whether the timeouts occur at a specific Valkey or Redis node, or at multiple nodes.
- Check whether the timeouts occur at a specific cluster or multiple clusters.
Redis synchronization
Redis synchronization initiates at backup, node replacement, and scaling events. This process is a compute-intensive workload that can cause latencies.
To check whether a synchronization affected your node performance, you can check the SaveInProgress metric in CloudWatch.
Note: To minimize the effects on user traffic, it's a best practice to schedule synchronization events during off-peak hours.
ElastiCache cluster events
If your ElastiCache cluster has a cluster event, then you might experience high latency during the event. You can use the ElastiCache console to review for events during the latency period. Check for background activities, such as node replacement or failover events from ElastiCache managed maintenance and service updates.
If you think that unexpected hardware failures caused high latency, then contact AWS Support.
Note: You can view scheduled Event Notifications in the AWS Health Dashboard.
Example event log:
Finished recovery for cache nodes 0001 Recovering cache nodes 0001 Failover from master node cluster_node to replica node cluster_node completed
Related information
Monitoring best practices with Amazon ElastiCache for Redis using Amazon CloudWatch
How do I turn on log delivery in an ElastiCache for Redis OSS or ElastiCache for Valkey cluster?
- Topics
- Database
- Language
- English

Relevant content
- asked 10 months ago
- asked 9 months ago