AWS Neptune Load is not Distributed evenly

0

We are using 3 readers and using the reader endpoints in the application, but the cpu of only one reader is reaching to max whereas other two readers has the cpu of less than 40, the load is not distributed evenly

  • Max and MIN connection pooling size we have given is 20 and 100 only, while we have tried using both gremlin-client classes

    NeptuneGremlinClusterBuilder and  GremlinClusterBuilder

    We are passing all the reader with the cluster but still the issue is same.

    We have there is one more method with is taking input of endpoint  networkLoadBalancerEndpoint(networkLoadBalancerEndpoint)  not sure how to configure this or from where we will get this endpoint as of now. Still trying to debug and find the solution.

asked a year ago296 views
3 Answers
0

Can you post the code that you use for configuring the NeptuneGremlinClusterBuilder? Thank you

AWS-ISR
answered a year ago
  • GremlinCluster readCluster = NeptuneGremlinClusterBuilder.build()                     .port(neptuneProperties.getReader().getPort())                     .enableSsl(neptuneProperties.isEnableSSL())                     .maxContentLength(NeptuneConstants.NEPTUNE_MAX_CONTENT_LENGTH)                     .maxConnectionPoolSize(neptuneProperties.getReader().getMaxConnectionPoolSize())                     .minConnectionPoolSize(neptuneProperties.getReader().getMinConnectionPoolSize())                     .maxSimultaneousUsagePerConnection(neptuneProperties.getReader().getMaxSimultaneousUsagePerConnection())                     .minSimultaneousUsagePerConnection(neptuneProperties.getReader().getMinSimultaneousUsagePerConnection())                     .maxInProcessPerConnection(neptuneProperties.getReader().getMaxInProcessPerConnection())                     .minInProcessPerConnection(neptuneProperties.getReader().getMinInProcessPerConnection())                     .addContactPoints(refreshAgent.getAddresses().get(endpointsSelector))                     .create(); GremlinClient client = readCluster.connect();         refreshAgent.startPollingNeptuneAPI(                 (OnNewAddresses) addresses -> client.refreshEndpoints(addresses.get(endpointsSelector)),                 60,                 TimeUnit.SECONDS);         DriverRemoteConnection connection = DriverRemoteConnection.using(client);         GraphTraversalSource g = AnonymousTraversalSource.traversal().withRemote(connection);

  • When we removed all the extra propeerties as shown below the distribution of load on readReplicas were fine (used all the default properties) GremlinCluster readCluster = NeptuneGremlinClusterBuilder.build()                     .port(neptuneProperties.getReader().getPort())                     .enableSsl(neptuneProperties.isEnableSSL())                     .addContactPoints(refreshAgent.getAddresses().get(endpointsSelector))                     .create();

    While we added content length property to increase it started failing the distribution of load to multiple replicas  GremlinCluster readCluster = NeptuneGremlinClusterBuilder.build()                     .port(neptuneProperties.getReader().getPort())                     .enableSsl(neptuneProperties.isEnableSSL())                     .maxContentLength(NeptuneConstants.NEPTUNE_MAX_CONTENT_LENGTH)                     .addContactPoints(refreshAgent.getAddresses().get(endpointsSelector))                     .create();

    We didnt find any issue or anything mentioned that this should make an issue but as per our analysis that is the cause, any suggesstion would be helpful.

0

The reader endpoint distributes connections, not individual requests, by changing the instance it points to every 5 seconds. During each 5 second window, every new connection will be directed to the same instance. Large connection pools that are opened eagerly will, therefore, likely attach most of the connections in the pool to a single instance. And because these are long-lived Websocket connections, they will continue forwarding traffic to the same instance for the duration of the connection.

More details on this, and some recommendations for mitigating are described here: https://aws.amazon.com/blogs/database/load-balance-graph-queries-using-the-amazon-neptune-gremlin-client/

AWS-ISR
answered a year ago
0

maxContentLength shouldn't have any impact on the endpoint selection logic – it's used purely to configure the frame size for responses from the server.

One more question: what endpoiuntSelector are you using? Is it an EndpoiuntsType enum value (and if so, which one)? Or is it a custom selector? Thanks

AWS-ISR
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions