Best Practices for ElasticSearch Cluster Failovers

Question

Hello All,

A customer is currently using AWS ElasticSearch in order to run their primary search function on their e-commerce website. Currently their queries often run for extended periods of time which ends up putting pressure onto the ES instances themselves, which forces them to crash and reboot.

This causes their websites functionality to be down until the AWS ElasticSearch service reboots the nodes. They are currently working on reducing the query times and have already been in contact with Premium support.

Ideally, I would just like to suggest any alternatives or failover solutions that they could implement until they are able to reduce the violent query requests they receive. I was wondering if the [Cross-Cluster][1] functionality could also be used as a backup option? Or perhaps implementing Route 53 Health Checks as well as another solution.

Either way, any feedback or input would be greatly appreciated!

[1]: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/cross-cluster-search.html#cross-cluster-search-set-up-connection

Accepted Answer

It sounds like the customer is already addressing the root cause of the problem (long queries), so I would suggest the following improvements/additions (if not already in place):

1. Query caching. Put Redis on Elasticache in front of Elasticsearch to cache query results. This can be as simple as base64-encoding the full JSON query object to use as the key, with the results as the value. Redis can expire cached objects as appropriate for the query validity (even if TTL is only 30 seconds, it can help enormously in a high-traffic ecommerce site). 
 2. Scale ES nodes vertically. ES loves memory and big queries love CPU. Not sure what their cluster looks like, but it sounds like fewer, larger nodes could help. 
 3. Rather than cross-cluster search, I'd rather suggest having a hot standby if they really really can't solve the root problem (and caching doesn't help). Route53 could be used to switch over to the hot standby. But this is an expensive option, obviously. And it should not be unnecessary if they right-size their cluster and resolve the query size issues. It feels like they may also have sub-optimal index patterns, document formats etc....?

Best Practices for ElasticSearch Cluster Failovers

Relevant content