내용으로 건너뛰기

How to configure Redis Client (Lettuce) to handle primary/replica failover ?

0

We have ElastiCache Cluster (5 shards, primary+replica in every shard), auto-failover enabled. Yesterday we encountered failover situation, one replica was promoted to primary, however our redis client (lettuce) was not able to discover new topology, and a lot of (all for that one particular shard?) PUT and DEL operations were failing with: Caused by: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 2 second(s)

How to configure client to handle such situation? Is enabling

ClusterTopologyRefreshOptions.builder()
            .enablePeriodicRefresh(Duration.ofSeconds(30))
            .enableAllAdaptiveRefreshTriggers()

enough ?

To connect, we use name configured in DNS CNAME record in route53 that points to cluster url (Configuration endpoint from redis configuration)

질문됨 3년 전1.3천회 조회
1개 답변
0

Hello,

A successful failover flags the malfunctioning node as failing and promotes the replica to primary. Clients are supposed to refresh cluster topology in periodic intervals (enablePeriodicRefresh) and in response to specific events (enableAllAdaptiveRefreshTriggers). Your configurations look good in these aspects.

The Elasticache cluster endpoint returns all nodes in the cluster (10 nodes in your case), and any healthy node can be used to refresh the client topology. I am not an authority on Lettuce, but based on your description, it looks like the client persisted trying to contact the failing node instead of refreshing the topology and route requests to the healthy nodes.

You may want to set lower timeouts in your socket options and make sure that dynamicRefreshSources is true.

Additionally, make sure that you have an up-to-date Lettuce version. I've found bug reports addressing issues while refreshing cluster topology in versions not that old.

Maybe not directly related to your question, but I would advise against using custom DNS names in front of the Elasticache DNS endpoint. Improper caching may keep the records for longer than ideal and cause trouble in case of scaling or even failover. Although not common, nodes IPs may change in exception conditions during failovers. Same is valid for client-side DNS caching, on Operating System or JVM level.

The following blog post provides additional best practices for Redis clients, and provides examples for Lettuce: https://aws.amazon.com/blogs/database/best-practices-redis-clients-and-amazon-elasticache-for-redis/

I hope my response has been helpful to you.

AWS
지원 엔지니어
답변함 3년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

관련 콘텐츠