Skip to content

[ElastiCache] Expected reader endpoint behaviour on sole replica being promoted to primary

1

I am trying to find reliable information on the expected behaviour in the following scenario. Setup:

  • Elasticache for Valkey - cluster mode disabled - 2 nodes (1 primary, 1 replica).
  • My use case requires 2 deployments of an app. One uses the primary endpoint to read and write to the cache. For the second identical deployment of the app, I want to limit it to cache reads only. So I use it with the reader endpoint.

Scenario: The primary instance goes down. The only replica is promoted to primary. I read that AWS will spin up a new replica to replace the downed instance, but this may take a few minutes (5-10+ mins). I understand the first app deployment uses the primary endpoint and will continue to function without issues once the fail-over is successful. But for second app using reader endpoint, which of the following will be true for the duration AWS takes to spin up a new replica and get things back to normal?

  • Will the reader endpoint become unavailable till a new replica comes up and ready to serve read-only operations?
  • Will the reader endpoint point to the new primary instance and serve traffic, but still only allow read operations?
  • Will the reader endpoint point to the new primary instance and serve traffic, but allow read and write operations?
asked 16 days ago68 views
2 Answers
4
Accepted Answer

I can confirm that the AI-generated response is technically accurate regarding ElastiCache (Valkey/Redis) behavior.

To provide more technical context on why this happens:

  1. DNS Routing: The Reader Endpoint is essentially a DNS CNAME that resolves to the IP addresses of the available replicas. When the replica count drops to zero during a failover, ElastiCache's control plane updates the DNS record to point to the Primary node to ensure high availability for reads, rather than returning an empty set or failing.

  2. Lack of Protocol Enforcement: ElastiCache endpoints do not 'intercept' commands. They only route traffic. Since the Primary node is configured to accept writes, any application connecting via the Reader Endpoint (which currently points to the Primary) will successfully execute write commands if it attempts them.

  3. The "Valkey/Redis" behavior: In Cluster Mode Disabled, nodes don't automatically enforce a READONLY state on the primary node just because a connection came through a specific endpoint.

My recommendation for your use case:

If it is mission-critical that your second app never writes to the cache, you should:

  • Application Level: Use a read-only client configuration or a specific database user with restricted permissions (if using RBAC).
  • RBAC (Recommended): If you are using Valkey/Redis 6.0+, create an IAM user or an ElastiCache user with an Access String like ~* &* -@write +@read and use these credentials in your 'read-only' app deployment. This enforces read-only behavior regardless of which endpoint is used.
EXPERT
answered 16 days ago
  • Thank you for the explanation Florian.

1

When your sole replica is promoted to primary during a failover, the reader endpoint will continue to point to the newly promoted primary node until a replacement replica is spun up and becomes available. During this interim period, the reader endpoint will allow both read AND write operations because it's pointing to a primary node.

This is an important consideration for your architecture. The reader endpoint doesn't enforce read-only behavior at the endpoint level - it simply routes to available replica nodes. When no replicas exist (as in your scenario after failover but before the replacement replica is ready), the reader endpoint will route to the primary node with full read-write capabilities.

For your second application deployment that should be limited to read-only operations, you'll need to implement read-only enforcement at the application level rather than relying solely on the reader endpoint. The reader endpoint provides routing convenience but doesn't provide access control restrictions.

Once ElastiCache spins up the new replica in the failed primary's availability zone and it syncs with the newly promoted primary (which typically takes several minutes), the reader endpoint will then point back to this new replica node for read operations.
Sources
Resilience in Amazon ElastiCache - Amazon ElastiCache
Minimizing downtime during maintenance - Amazon ElastiCache
Resilience testing on Amazon ElastiCache with AWS Fault Injection Service | AWS Database Blog

answered 16 days ago
EXPERT
reviewed 16 days ago
EXPERT
reviewed 16 days ago
  • @Riku_Kobayashi san, may I know if you fully agree with this AI generated answer?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.