How do I follow best practices for failover and recovery events for ElastiCache for Valkey or ElastiCache for Redis OSS self-designed clusters?

7 minutos de lectura
0

I want to follow best practices for failover events in my Amazon ElastiCache for Valkey or Amazon ElastiCache for Redis OSS self-designed cluster.

Short description

Failover and recovery events are essential parts of Amazon ElastiCache that allow ElastiCache to be resilient. However, when failover and recovery events occur, these events can affect the performance and availability of your application.

It's a best practice to reduce issues from failover and recovery events that affect your cluster by taking the follow actions:

  • Review your events.
  • Understand the cause of the events.
  • Prepare for the events.
  • Configure event notifications.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Review your events

ElastiCache logs various events related to your cluster, security groups, and parameter groups.

Events include, but are not limited to, resource creations and deletions, scaling operations, failovers, node reboots, and snapshot creations. To better understand and analyze events in your ElastiCache cluster, review your ElastiCache events.

Example failover events in the ElastiCache event logs:

December 5, 2024, 10:12:20 Finished recovery for cache nodes 0001
December 5, 2024, 10:10:48 Recovering cache nodes 0001
December 5, 2024, 10:05:45 Recovering cache nodes 0001
December 5, 2024, 10:04:24 Failover from master node <node name> to replica node <node name> completed

Example recovery events in the ElastiCache event logs:

2022-10-05 19:20 Finished recovery for cache nodes 0001
2022-10-05 19:18 Recovering cache nodes 0001
2022-10-05 19:14 Recovering cache nodes 0001

Note: Amazon ElastiCache for Memcached doesn't support failover, but you might see similar messages in the event logs for a recovery event.

Understand the cause of the event

During a failover event, ElastiCache replaces an unavailable primary node with a replica node. ElastiCache also replaces primary nodes for user requested actions or planned events. For more information, see Amazon ElastiCache FAQs.

Examples events:

  • To test failover functionality
  • To perform planned maintenance
  • To resolve Availability Zone issues

If a replica node experiences availability issues, then ElastiCache replaces the replica with a new replica node.

Note: This replacement doesn't start a failover event.

When ElastiCache attempts to restore the cluster in these situations, ElastiCache logs these recovery events.

Note: To determine if a node is primary or not, use the IsMaster Amazon CloudWatch metric. For more information, see Metrics for Valkey and Redis OSS.

Unplanned failover and recovery events

In ElastiCache, an unplanned failover occurs when the primary node fails unexpectedly and prompts the service to promote a replica node to the primary role. Similarly, if a replica node needs a replacement, ElastiCache automatically provisions a new replica node when a replica fails. Both processes minimize downtime and maintain high availability. The following are common causes of unplanned failover and replacement:

  • For underlying issues related to the ElastiCache host such as hardware failure, networking issues, or Availability Zone failure, ElastiCache performs a recovery. For the AWS infrastructure, in the rare event of a failure, automated processes allow for high availability of the cluster.
  • For heavy workload, Amazon ElastiCache for Redis OSS and Amazon ElastiCache for Valkey are single threaded. Because of this, long-running commands can block other operations. Excessive workload in the cluster can lead to over-utilization and exhaustion of resources, and lead to the failover and recovery. For example, complex commands, inefficient Lua scripts, and large key-based operations can overwhelm the cluster and degrade performance.

Note: When a primary replica fails because of a temporary availability zone disruption, ElastiCache launches the new replica after the availability zone recovers.

Planned failover and recovery events

Planned failover and recovery events can occur for scheduled maintenance or user-initiated operations.

For scheduled maintenance, AWS regularly upgrades the ElastiCache fleet to strengthen the security, reliability, and operational performance of ElastiCache clusters. Scheduled maintenance events, such as for node replacements and service updates as part of continuous managed maintenance, can start failover and recovery events. For more information, see Amazon ElastiCache managed maintenance and service updates help page.

For user-initiated operations, the user initiates TestFailover through the TestFailover API, the test-failover AWS CLI command, or the ElastiCache console. To promote a read replica to a primary cluster mode disabled cluster, initiate a promote operation. For more information, see Promoting a read replica to primary, for Valkey or Redis OSS (cluster mode disabled) replication groups.

Note: In some conditions, such as during large-scale operational events, AWS might block this API. If AWS blocks the API, then you see the following message in your event logs: "Test Failover API called for node group 0001."

Prepare for events

For planned failover events, such as for maintenance or service updates, ElastiCache replaces the nodes when the cluster serves incoming write requests. To mitigate issues, follow best practices for planned failover events. For more information, see Amazon ElastiCache managed maintenance and service updates help page.

For unplanned failover events, ElastiCache failover automatically occurs when you turn on Multi-AZ for your cluster.

Note: If failover occurs on a replica when you write to a node that uses the replica endpoint, then the node might be unavailable. After you replace the replica, the node becomes available for read requests.

To reduce issues during planned and unplanned events, follow connectivity and configuration best practices.

Configure event notifications

To quickly respond to events and their causes, configure ElastiCache to send notifications when there's a failover or recovery in a cluster. For more information, see Managing ElastiCache Amazon Simple Notification Service (Amazon SNS) notifications.

When you configure ElastiCache to use Amazon SNS for notifications, you receive notifications similar to the following examples:

Example recovery events:

Recovery reason : Recovery completed for node as ElastiCache monitoring detected a
 network reachability failure on the node, ElastiCache:CacheNodeReplaceComplete : <node>
Recovery reason : Recovery completed for node as ElastiCache monitoring detected
 software issues on the node, ElastiCache:CacheNodeReplaceComplete : <node>
Recovery reason : Recovery completed for node as ElastiCache monitoring detected
 unresponsive engine on the node, ElastiCache:CacheNodeReplaceComplete : <node>
Recovery reason : Recovery completed for node as ElastiCache monitoring detected
busy and unresponsive engine on the node, ElastiCache:CacheNodeReplaceComplete : <node>

Example failover events:

Failover reason : Failover completed for node as ElastiCache monitoring detected a 
network reachability failure on the node, ElastiCache:FailoverComplete : <node>
Failover reason : Failover completed for node as ElastiCache monitoring 
detected software issues on the node, ElastiCache:FailoverComplete : <node>
Failover reason : Failover completed for node as ElastiCache monitoring 
detected unresponsive engine on the node, ElastiCache:FailoverComplete : <node>
Failover reason : Failover completed for node as ElastiCache monitoring detected busy
 and unresponsive engine on the node, ElastiCache:FailoverComplete : <node>

Note: ElastiCache for Memcached doesn't support enhanced messages for recovery events.

Related information

Monitoring best practices with Amazon ElastiCache for Redis using Amazon CloudWatch

How do I troubleshoot high latency issues in ElastiCache for Redis?

How do I troubleshoot increased CPU usage in my ElastiCache for Redis self-designed cluster?

OFICIAL DE AWS
OFICIAL DE AWSActualizada hace 5 días