How do I troubleshoot high latency issues in ElastiCache for Redis?

6 minute read
1

I want to troubleshoot high latency issues in Amazon ElastiCache for Redis.

Short description

The following are common causes of high latency issues in ElastiCache for Redis:

  • Slow commands
  • Increased swap activity that's caused by high memory usage
  • Network issues
  • Client side latency issues
  • Redis synchronization
  • ElastiCache cluster events

Resolution

Troubleshoot your high latency issues based on the following causes:

Slow commands

Redis is single-threaded. So, when a request is slow to serve, other clients must wait to be served. This slowdown results in an increase in the total time that requests are served. To monitor the average latency for different classes of commands, use Amazon CloudWatch metrics for Redis. Also, common Redis operations are calculated in microsecond latency. CloudWatch metrics are sampled every minute and latency metrics are shown as an aggregate of multiple commands. A single command can cause unexpected results, such as timeouts, without significant changes that appear in the metric graphs. To retrieve a list of commands that take a long time to complete, use SLOWLOG GET. Also, run the slowlog get 128 command in the redis-cli. For more information, see SLOWLOG GET on the Redis website. Also, see How do I turn on Redis Slow log in an ElastiCache for Redis cache cluster?

If there's an increase in the EngineCPUUtilization metric, then see How do I troubleshoot increased CPU usage in my ElastiCache for Redis self-designed cluster?

The following are examples of complex commands:

  • KEYS that are in production environments over large datasets. KEYS sweep the entire keyspace and search for specified patterns. For more information, see KEYS on the Redis website.
  • Lua scripts that take a long time to run. For more information, see Scripting with Lua on the Redis website.

Increased swap activity that's caused by high memory usage

Redis swaps memory pages when there's increased memory pressure on the cluster. This might cause latency to increase and timeouts to occur because of memory pages that are transferred to and from the swap area. The following indicate increased swap activity in CloudWatch metrics:

  • Increased SwapUsage
  • Very low FreeableMemory
  • High BytesUsedForCache and DatabaseMemoryUsagePercentage metrics

To troubleshoot your increased swap activity, see the following resources:

Network issues

To troubleshoot high latency issues that are caused by network issues, see the following scenarios:

Network latency between the client and the ElastiCache cluster
To isolate network latency between the client and cluster nodes, see How do I troubleshoot network performance issues between EC2 Linux or Windows instances in a VPC and an on-premises host over the internet gateway?

The cluster reaches network limits
An ElastiCache node shares the same network limits as the related Amazon Elastic Compute Cloud (Amazon EC2) instances. For example, the node type of cache.m6g.large has the same network limits as the m6g.large Amazon EC2 instance. For information on troubleshooting your ElastiCache node network limits, see Network-related limits. Also, it's a best practice to monitor network performance for your Amazon EC2 instance and check your bandwidth capability, packet-per-second (PPS) performance, and connections tracked.

TCP/SSL handshake latency

Clients use a TCP connection to connect to Redis clusters. TCP connections can take a few milliseconds to be created. This delay might result in additional overhead created on your application's Redis operations. Also, the ElastiCache node receives additional pressure on the CPU. Make sure that you control the volume of new connections, especially when your cluster uses the ElastiCache in-transit encryption (TLS) feature. A high volume of connections that are opened (NewConnections) and closed might affect the node's performance. For a large number of connections, use connection pooling to cache established TCP connections into a pool. To implement connection pooling, use your Redis client library (if supported), or manually build your connection pool. You can also use aggregated commands such as MSET/MGET or Redis pipelines as an optimization technique. For more information, see Redis pipelining on the Redis website.

There are a large number of connections on the ElastiCache node

It's a best practice to track the CurrConnections and NewConnections CloudWatch metrics. These metrics monitor the number of TCP connections that are accepted by Redis. A large number of TCP connections might lead to the exhaustion of the 65,000 maxclients limit. This limit is the maximum concurrent connections that you can have per node. For more information, see Maximum concurrent connected clients on the Redis website. When you reach the 65,000 limit, the ERR max number of clients reached error appears. If more connections are added beyond the Linux server limit or maximum number of connections tracked, then connection timeouts occur. For more information on how to prevent a large number of connections, see Best practices with Redis clients.

Client side latency issues

To determine if client side resources cause latency issues, check the memory, CPU, and network utilization on the client side. Make sure that these resources aren't near their limits. If your application runs on an Amazon EC2 instance, then use CloudWatch metrics to identify issues. Also, use a monitoring tool inside the Amazon EC2 instance, such as atop or CloudWatch agent.

If the timeout configuration values set up on the application are too small, then you might receive unnecessary timeout errors. To resolve these errors, configure the client side timeout to allow the server enough time to process requests and generate responses. For more information, see Best practices with Redis clients. Also, timeout errors show additional information. Make sure that you review the timeout error details to isolate the cause of your latency. Check for the following patterns to determine whether latency is caused by the client side, the ElastiCache node, or the network:

  • Check whether the timeouts occur frequently or at a specific time of the day.
  • Check whether the timeouts occur at a specific client or multiple clients.
  • Check whether the timeouts occur at a specific Redis node or multiple nodes.
  • Check whether the timeouts occur at a specific cluster or multiple clusters.

Redis synchronization

Redis synchronization initiates at backup, node replacement, and scaling events. This is a compute-intensive workload that can cause latencies. To check if synchronization is in progress, use the SaveInProgress CloudWatch metric. For more information, see How synchronization and backup are implemented.

ElastiCache cluster events

To check the time period that the latency occurred, view the Events section in the ElastiCache console. Check for background activities, such as node replacement or failover events that could be caused by ElastiCache managed maintenance and service updates. Also, check for unexpected hardware failures. Scheduled Event Notifications are received through the AWS Health Dashboard and email.

Example event log:

Finished recovery for cache nodes 0001Recovering cache nodes 0001
Failover from master node <cluster_node> to replica node <cluster_node> completed

Related information

Monitoring best practices with Amazon ElastiCache for Redis using Amazon CloudWatch

Additional troubleshooting stepshttps://kcs.support.aws.dev/article/elasticache-redis-correct-high-latency/2

AWS OFFICIAL
AWS OFFICIALUpdated 5 months ago