AWS Neptune automated failover does not happen on non-responding endpoints

0

Hello, We have a small AWS Neptune cluster with 2 nodes, one Writer, one Reader. Today we found many errors related to timeouts in one of the consuming services sending Gremlin queries to the cluster. Analyzing the problem showed that the Reader was just responding well, using the database/instance/node endpoint of the Reader, while even the simplest request to the Writer endpoint - using database/instance/node endpoint OR the cluster Writer Endpoint OR the cluster Reader Endpoint OR a custom endpoint having both nodes as members - run into a timeout. Timeout happened for the default period or event if we increase the timeout for the query to some hours. Like if the endpoint service part was somehow disconnected from the underlying database.

In the end a manual failover helped to resolve this, and a second failover to ensure that the peviously defective node was working again.

We know that, e.g. for OutOfMemory errors an automated failover will happen. Is there a build-in-way to ensure that Neptune will automatically failover, if requests to endpoints are not reaching the database itself are not being processed at all and just idling forever (aka until any timeout was reached)?

Patrick
asked 4 months ago238 views
1 Answer
0

TimeOut exceptions (and OutofMemory exceptions, for that matter) are not the product of instance availability, but happen due to issues with the execution of a query.

TimeOut exceptions occur when the execution thread for a given query is not able to complete the execution of a query within the defined query timeout period (2 minutes, by default). This typically happens if you have a query that's trying to execute a more analytical type of query pattern. Neptune is designed for highly concurrent workloads with more transactional (OLTP) style queries (queries that are constrained to a small portion of the graph). If you need to execute analytical (OLAP) queries, you may want to investigate using the newer Neptune Analytics functionality.

OutOfMemory exceptions are not indicative of an instance running out of memory. OOM exceptions occur with the execution thread for a query runs out of allocated memory for that query thread. Within Neptune, we allocate approximately 2/3 of instance memory to buffer pool cache. The other 1/3 (minus what we need to execute the operating system on each instance) is divided up between the number of query execution threads (which the number of threads is equal to 2x the number of vCPUs on an instance).

Encountering either of these exceptions will not cause a cluster failover, as these are not the result of infrastructure becoming unavailable.

profile pictureAWS
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions