AWS Neptune automated failover does not happen on non-responding endpoints

0

Hello, We have a small AWS Neptune cluster with 2 nodes, one Writer, one Reader. Today we found many errors related to timeouts in one of the consuming services sending Gremlin queries to the cluster. Analyzing the problem showed that the Reader was just responding well, using the database/instance/node endpoint of the Reader, while even the simplest request to the Writer endpoint - using database/instance/node endpoint OR the cluster Writer Endpoint OR the cluster Reader Endpoint OR a custom endpoint having both nodes as members - run into a timeout. Timeout happened for the default period or event if we increase the timeout for the query to some hours. Like if the endpoint service part was somehow disconnected from the underlying database.

In the end a manual failover helped to resolve this, and a second failover to ensure that the peviously defective node was working again.

We know that, e.g. for OutOfMemory errors an automated failover will happen. Is there a build-in-way to ensure that Neptune will automatically failover, if requests to endpoints are not reaching the database itself are not being processed at all and just idling forever (aka until any timeout was reached)?

Patrick
已提問 5 個月前檢視次數 253 次
1 個回答
0

TimeOut exceptions (and OutofMemory exceptions, for that matter) are not the product of instance availability, but happen due to issues with the execution of a query.

TimeOut exceptions occur when the execution thread for a given query is not able to complete the execution of a query within the defined query timeout period (2 minutes, by default). This typically happens if you have a query that's trying to execute a more analytical type of query pattern. Neptune is designed for highly concurrent workloads with more transactional (OLTP) style queries (queries that are constrained to a small portion of the graph). If you need to execute analytical (OLAP) queries, you may want to investigate using the newer Neptune Analytics functionality.

OutOfMemory exceptions are not indicative of an instance running out of memory. OOM exceptions occur with the execution thread for a query runs out of allocated memory for that query thread. Within Neptune, we allocate approximately 2/3 of instance memory to buffer pool cache. The other 1/3 (minus what we need to execute the operating system on each instance) is divided up between the number of query execution threads (which the number of threads is equal to 2x the number of vCPUs on an instance).

Encountering either of these exceptions will not cause a cluster failover, as these are not the result of infrastructure becoming unavailable.

profile pictureAWS
已回答 5 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南