How do I minimize downtime when my ElastiCache for Redis is scaling?

4 minute read

I want to minimize my downtime when my Amazon ElastiCache for Redis is scaling.


To help minimize downtime, review these actions and check your setup and maintenance procedures:

  • To minimize downtime during synchronization, avoid scaling when you have a high workload. If the cluster has a high workload and scaling is taking a long time, then reduce the incoming requests to Redis to prevent synchronization failure. If synchronization occurs, then check the SaveInProgress metric in Amazon CloudWatch to determine when the synchronization occurred. Note that the SaveInProgress metric collects data every minute and might not capture synchronization that finished under one minute. For more information, see Monitoring best practices with Amazon ElastiCache for Redis using Amazon CloudWatch.
  • To identify issues that are caused by a client-side misconfiguration when you connect to the cluster, test scaling in a non-production environment. Based on the scaling type, a node might be added during scaling, removed during scaling, or the node IP address might change during scaling. ElastiCache for Redis provides different types of connection endpoints to connect to the cluster, so the chosen connection endpoint type depends on the application requirements.
  • Configure the Redis client or application code to retry the query on another replica or to send a query to the primary application code. If the client connects to a new replica that's in the synchronization process, then the LOADING: Redis is loading the dataset in memory error appears. The time that it takes to load the dataset is based on the data size and performance of the node. To determine if this is an issue, test in a non-production environment.
  • Configure the cluster to scale automatically. Automatic scaling prevents performance issues caused by sudden increases in the incoming workload. For more information, see Auto Scaling ElastiCache for Redis clusters.

For Redis clusters that have cluster mode turned off, review these actions and check your setup and maintenance procedures:

  • For scaling in, if your applications only use a primary endpoint to connect, then downtime doesn't occur when you remove a replica node. If your applications use a reader or individual endpoints to connect to that replica node, then the original connection breaks. When the original connection breaks, a new TCP connection must be established. The application also has to perform a DNS lookup to avoid connecting to the removed replica node. If the client uses reader endpoints, then downtime might occur due to the DNS propagation of reader endpoints.
  • For scaling out, make sure that you scale out during hours that the workload is minimal to avoid the downtime that's caused by synchronization.
  • For node type changes, heavy workloads might cause synchronization to fail. Also, your application might need to perform DNS lookup on the primary or reader endpoints to establish new connections to the new node. DNS propagation takes a few seconds and a service interruption might occur before the client reaches the new node. For Redis versions 5.0.5 or newer, the interruption is minimized. It's a best practice to upgrade to the new Redis version to optimize ElastiCache.

For Redis clusters that have cluster mode turned on, review these actions and check your setup and maintenance procedures:

  • To have minimal or no downtime during scaling, see Redis cluster client discovery and exponential backoff.
  • To help minimize downtime when scaling in, see Online cluster resizing. To minimize performance issues, scale gradually. Make sure that you check the cluster's performance during peak time after the initial scale in before you further scale in.
  • To help minimize downtime when scaling out, see Online cluster resizing.
  • Heavy workloads might cause synchronization to fail when there are node type changes. Also, the new node IP addresses might not be the same as the old nodes. To determine the IP address, your application can use the cluster nodes or cluster slots command to get updated information from the cluster. Redis clients that support Redis clusters can update the cluster topology. To configure the Redis client, see the documentation for your specific client type.
  • When you change the number of replicas, make sure that you first check the performance of the primary nodes before you add additional replica nodes. When the number of replica nodes decrease and the client must read from a removed replica node, requests are sent to new replica nodes. Also, to prevent requests to the removed nodes, the client must update the cluster topology.

Related information

Replication: Redis (Cluster Mode Disabled) vs. Redis (Cluster Mode Enabled)

Find your node endpoints

Scaling ElastiCache for Redis

Making sure that you have enough memory to create a Redis snapshot

Best practices with Redis clients

AWS OFFICIALUpdated 13 days ago