AWS Neptune automated failover does not happen on non-responding endpoints

0

Hello, We have a small AWS Neptune cluster with 2 nodes, one Writer, one Reader. Today we found many errors related to timeouts in one of the consuming services sending Gremlin queries to the cluster. Analyzing the problem showed that the Reader was just responding well, using the database/instance/node endpoint of the Reader, while even the simplest request to the Writer endpoint - using database/instance/node endpoint OR the cluster Writer Endpoint OR the cluster Reader Endpoint OR a custom endpoint having both nodes as members - run into a timeout. Timeout happened for the default period or event if we increase the timeout for the query to some hours. Like if the endpoint service part was somehow disconnected from the underlying database.

In the end a manual failover helped to resolve this, and a second failover to ensure that the peviously defective node was working again.

We know that, e.g. for OutOfMemory errors an automated failover will happen. Is there a build-in-way to ensure that Neptune will automatically failover, if requests to endpoints are not reaching the database itself are not being processed at all and just idling forever (aka until any timeout was reached)?

Patrick
질문됨 4달 전249회 조회
1개 답변
0

TimeOut exceptions (and OutofMemory exceptions, for that matter) are not the product of instance availability, but happen due to issues with the execution of a query.

TimeOut exceptions occur when the execution thread for a given query is not able to complete the execution of a query within the defined query timeout period (2 minutes, by default). This typically happens if you have a query that's trying to execute a more analytical type of query pattern. Neptune is designed for highly concurrent workloads with more transactional (OLTP) style queries (queries that are constrained to a small portion of the graph). If you need to execute analytical (OLAP) queries, you may want to investigate using the newer Neptune Analytics functionality.

OutOfMemory exceptions are not indicative of an instance running out of memory. OOM exceptions occur with the execution thread for a query runs out of allocated memory for that query thread. Within Neptune, we allocate approximately 2/3 of instance memory to buffer pool cache. The other 1/3 (minus what we need to execute the operating system on each instance) is divided up between the number of query execution threads (which the number of threads is equal to 2x the number of vCPUs on an instance).

Encountering either of these exceptions will not cause a cluster failover, as these are not the result of infrastructure becoming unavailable.

profile pictureAWS
답변함 4달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인