AWS ElastiCache Serverless `Wait` Command and Durability

0

The wait command in EC Serverless is unsupported. I understand that EC Serverless has 99.99% SLA and is designed to appear as a Redis/Valkey cluster to a connected client and abstracts the complexity of managing a cluster through the proxy nodes. I'm a bit confused by this claim in the docs:

Serverless caches automatically mitigate node failures with a Multi-AZ architecture so that node failures are transparent to your application.

But then the docs also state this:

[Multi-AZ Serverless] Data is replicated asynchronously across multiple Availability Zones for higher availability and improved read latency.

EC Serverless is a cache with the eviction policy set to volatile-lru, so I understand that it shouldn't be used for persistent data that requires durability, but my use case involves using Redisson for distributed locks. Without the ability to use the wait command, am I correct in assuming that EC Serverless synchronously replicates data from the primary to read replicas in the same AZ, but also asynchronously replicates data from the primary to read replicas in different AZs? I'm just confused by the claim that node failures are transparent to my application, which seems to imply there would be no data loss from a primary node failure, because EC Serverless will transparently synchronously replicate data. My concern is that Redisson RLock isn't a durable lock because of the lack of support for the wait command and the possibility of an EC Serverless primary node failure. Relevant code here.

By the way, this re:Post answer by the bot claims that EC Serverless persists data to the disk, but that doesn't seem to be correct as there is no mention of disk persistence for EC Serverless in the docs, besides for backups of course.

asked a month ago35 views
1 Answer
0

You've raised some important points about Amazon ElastiCache Serverless and its replication behavior. Let's clarify a few things:

  1. ElastiCache Serverless does indeed use asynchronous replication across multiple Availability Zones (AZs) for higher availability and improved read latency. This means that there is a possibility of data loss in the event of a primary node failure, as the most recent writes may not have been replicated to all AZs yet.

  2. The claim that node failures are transparent to your application refers to the high availability and automatic failover capabilities of ElastiCache Serverless. While it aims to minimize disruption, it doesn't guarantee zero data loss in all scenarios.

  3. Regarding synchronous vs. asynchronous replication within and across AZs, ElastiCache Serverless documentation doesn't explicitly state that it uses synchronous replication within the same AZ. The service is designed to abstract these details from users, focusing on providing high availability and scalability.

  4. For your use case with Redisson for distributed locks, the lack of support for the wait command in ElastiCache Serverless could indeed impact the durability of locks. Without the ability to ensure synchronous replication, there's a risk that a lock might not be fully propagated across all nodes before a failover occurs.

  5. Your concern about Redisson RLock not being a durable lock in this context is valid. The combination of asynchronous replication and the absence of the wait command means that there's no guarantee of immediate consistency across all nodes, which could lead to potential issues with distributed locking.

  6. Regarding data persistence, ElastiCache Serverless is primarily an in-memory cache and doesn't persist data to disk as part of its normal operation. The service does support backups, but this is different from continuous disk persistence. The claim about ElastiCache Serverless persisting data to disk at rest (outside of backups) appears to be incorrect.

Given these considerations, if your application requires strong consistency guarantees for distributed locking, you might need to explore alternative solutions or consider using a self-managed ElastiCache cluster where you have more control over replication settings and can use the wait command.

Remember that ElastiCache Serverless is designed for ease of use and automatic scaling, which comes with some trade-offs in terms of fine-grained control over replication and consistency. For use cases requiring strict consistency guarantees, especially for distributed locking, you may need to carefully evaluate whether the service meets your specific requirements.
Sources
Choosing between deployment options - Amazon ElastiCache
Valkey-, Memcached-, and Redis OSS-Compatible Cache – Amazon ElastiCache FAQs - AWS

profile picture
answered a month ago
  • Decent answer from the bot, but would be nice to have an AWS Export give a bit more context on what happens inside an EC Serverless cluster.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions