intermittent performance issues with the OpenSearch cluster

0

We are having intermittent performance issues with our OpenSearch cluster. There are multiple daily occurrences of long-running OpenSearch queries when executing our lambda function. When this occurs there is record like the example below in the Open Search es-application-logs

2022-09-09T11:37:35.083-04:00 [2022-09-09T08:37:34,853][WARN ][o.o.t.TransportService ] [09122640e72c461b2c57179fcc08d339] Received response for a request that has timed out, sent [44029ms] ago, timed out [24016ms] ago, action [PATH], node [{d52117a284ca0ada6bd69784bbd8a8c5} {ekzv2YqtT8q7b-TVvXGKEQ} {aGQCli5rTFKAeGjWRihvTQ} {IP} {IP} {dimr} {dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=true, _AMAZON_INTERNAL, cross_cluster_transport_address=IP, shard_indexing_pressure_enabled=true, _AMAZON_INTERNAL} ], id [3986058]

Normally these queries complete with sub second response times. The impact to our application is that occasionally requests to API Gateway are timing out due to the long running Lambda and results in 504 Gateway Timeout being returned to the API consumer. Here is an example of the OpenSearch HTTP request

{ "method": "POST", "hostname": "vpc-tycc-analytics-dev-os-qs4p2rpyars2oacibpdih4ctmu.ca-central-1.es.amazonaws.com", "query": { "size": "500", "_source": [ "CurrentAgentSnapshot" ], "sort": "CurrentAgentSnapshot.Configuration.Username:asc" }, "headers": { "Content-Type": "application/json", "host": "vpc-tycc-analytics-dev-os-qs4p2rpyars2oacibpdih4ctmu.ca-central-1.es.amazonaws.com" }, "body": "{"query":{"bool":{"must_not":{"term":{"CurrentAgentSnapshot.Configuration.RoutingProfile.Name":"Basic Routing Profile"}}}}}", "protocol": "https:", "path": "/agent-records-real-time-alias/_search" }

Are you able to please investigate and provide feedback as to what is causing the performance issue along with remediation action.

Let me know if you require additional information.

Thanks,

BTW The following is out cluster specification:

• Deployment type: Dev/Test • OS version: 1.2 (latest) • Availability Zones: 1AZ • Instance type: t3.small.search • Number of nodes: 2

• Storage type: EBS • EBS volume type: General Purpose (SSD) - gp2 EBS storage size per node: 10 GB

1 Answer
0

Hi,

With the limited info available:

  1. Look for more log lines (before + after) the one you have shared in the post. There could be some messages related to garbage collection, nodes getting disconnected etc.
  2. Check cloudwatch metrics like cluster status, free storage space, indexing rate, search rate, CPU utilization, JVM pressure, Memory utilization at around the incident time for any deviation.
  3. Do you have too many shards and indices? You can check some best practices here - https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp.html#bp-sharding-strategy

--Syd

profile picture
Syd
answered a year ago
  • Many thanks. Please find below the log list for the first part of your answer

    022-10-10T18:19:51.482-04:00 [2022-10-10T15:19:50,699][WARN ][o.o.m.j.JvmGcMonitorService] [09122640e72c461b2c57179fcc08d339] [gc][3851910] overhead, spent [815ms] collecting in the last [1.1s]

    2022-10-10T18:20:53.519-04:00 [2022-10-10T15:20:52,911][WARN ][o.o.m.j.JvmGcMonitorService] [09122640e72c461b2c57179fcc08d339] [gc][3851943] overhead, spent [29.4s] collecting in the last [30.1s]

    2022-10-10T22:18:23.563-04:00 [2022-10-10T19:18:23,014][WARN ][o.o.t.TransportService ] [09122640e72c461b2c57179fcc08d339] Received response for a request that has timed out, sent [54249ms] ago, timed out [34233ms] ago, action [PATH], node [{d52117a284ca0ada6bd69784bbd8a8c5}{ekzv2YqtT8q7b-TVvXGKEQ}{vBbzFjeDTymDJ7NEPqJBLg}{IP}{IP}{dimr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=true, AMAZON_INTERNAL, cross_cluster_transport_address=IP, shard_indexing_pressure_enabled=true, AMAZON_INTERNAL}], id [13800571]

    2022-10-10T22:18:23.566-04:00 [2022-10-10T19:18:23,018][WARN ][o.o.t.TransportService ] [09122640e72c461b2c57179fcc08d339] Received response for a request that has timed out, sent [33232ms] ago, timed out [13207ms] ago, action [PATH], node [{d52117a284ca0ada6bd69784bbd8a8c5}{ekzv2YqtT8q7b-TVvXGKEQ}{vBbzFjeDTymDJ7NEPqJBLg}{IP

  • [gc][3851910] overhead, spent [815ms] collecting in the last [1.1s] [gc][3851943] overhead, spent [29.4s] collecting in the last [30.1s] It seems to be spending considerable time in GC which is not good but there is also 4 hour gap between GC log event and timeout log event. Can you corroborate this by checking the Cloudwatch metrics as mentioned in #2 of the initial comment.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions