How do I minimize downtime in ElastiCache during failover?

5 minute read
0

I want to follow best practices to minimize downtime during failovers for Amazon ElastiCache for Redis OSS and Amazon ElastiCache for Valkey.

Short description

ElastiCache can experience failovers that affect application performance and reliability for the following reasons:

  • If you deplete resources, such as memory, CPU, and network bandwidth, then your nodes might failover.
  • If AWS schedules a maintenance event for an update, then your nodes might failover.
  • If the physical hardware for the infrastructure that hosts the nodes breaks or has issues, then your nodes might failover.
  • If you don't correctly configure applications or services that interact with your cache, then the configurations can lead to a longer downtime.

Resolution

Turn on Multi-AZ

To create and maintain primary and replica nodes across different Availability Zones (AZs) in an AWS Region, use the ElastiCache Multi-AZ feature. If the primary node fails, then the replica node takes over the role as primary node with minimal downtime.

Add read replicas

When you add read replicas to your Redis deployments, you significantly minimize downtime and data loss in failover tasks. When your read replicas manage read requests, configure your primary node to handle write operations. This configuration provides the following benefits:

  • Improves read throughput
  • Reduces latency
  • Provides fault tolerance
  • Simplifies maintenance tasks that cause downtime for your cluster

Distribute nodes across Availability Zones

When you distribute nodes across multiple Availability Zones, the replicas in different AZs provide high availability and continuous read operations. This configuration enhances system resilience and reduces downtime in case of node failover. You can distribute your nodes across multiple AZs when you first configure your cluster or when you add new nodes to an existing cluster. For more information, see Choosing regions and availability zones for ElastiCache.

Use the latest Valkey or Redis OSS version

Based on your cluster and node type, use the latest version of Valkey or Redis OSS to support the latest features. For example, cluster mode disabled clusters require Valkey version 7.2 or Redis OSS version 5.0.6 or later to use the planned node replacements feature. For more information, see Supported node types.

Monitor cluster events

To identify and respond to failovers, review your ElastiCache cluster events. To detect failovers early, use Amazon Simple Notification Service (Amazon SNS) to configure ElastiCache to send notifications for important cluster events.

Use the correct endpoints

To minimize downtime during failover, you must use the correct endpoints for your ElastiCache for Redis OSS cluster based on your cluster configuration. To distribute read workloads across replicas for cluster mode disabled clusters, use the primary endpoint for write operations and the reader endpoint for read operations. For cluster mode enabled clusters, use the configuration endpoint for all operations to automatically manage connections to the correct nodes. When you choose the correct endpoint for your cluster mode, you optimize performance and create a smooth failover process. For more information, see Finding connection endpoints in ElastiCache.

Note: It's not a best practice to directly use individual node endpoints. Instead, use the correct endpoints for your connection type. Because node roles can change during failover events, you have more application issues if you use an individual node endpoint.

Regularly test automatic failover

To maintain reliable Redis deployments, it's a best practice to regularly test automatic failover. To test automatic failover, you must simulate primary node failure to make sure that your replicas are promoted to primary status. These tests can identify issues in your configurations and allow you to address the issues before they affect your clusters. Additionally, failover tests provide insights on application performance and how to optimize your architecture and recovery procedures.

Follow best practices for your Redis clients

For your Redis clients, follow these best practices:

  • To enhance application performance and scalability, use connection pooling to manage reusable, pre-established connections. For more information, see Connection pools and multiplexing on the Redis website.
  • Implement exceptions and timeout handling to maintain efficient applications for your Redis cluster. You can also review your logs for timeouts to identify issues and adjust your configurations. For more information, see Client timeouts on the Redis website.
  • To maintain resilient applications, implement retry mechanisms that use an exponential backoff strategy. Configure the mechanisms to differentiate between transient errors that warrant retries, and permanent failures that don't warrant retries. For more information, see Cluster client discovery and exponential backoff (Valkey and Redis OSS).
  • Turn on logs to capture key metrics and errors, and establish a performance baseline for your cluster. For more information, see Logging events on the Redis website.
  • Design your clients to dynamically handle cluster topology changes as well as adapt to node and role changes. To maintain connections to cluster nodes and optimize your clusters, implement smart connection pooling. For more information, see Redis cluster and client libraries on the Redis website.

Related information

Monitoring best practices with Amazon ElastiCache for Redis using Amazon CloudWatch

How do I troubleshoot high latency issues in ElastiCache for Redis?

Supported connection clients on the Redis website

AWS OFFICIAL
AWS OFFICIALUpdated 10 days ago