OpenSearch Ingestion Services (OSIS) Buffer Overflow and Performance Issues

0

Hello AWS Community,

I'm reaching out to share an issue we've recently experienced with AWS OpenSearch Ingestion Services (OSIS) and to seek your guidance and advice.

We have started to implement OSIS as part of our data pipeline, and after one of the tests after running OSIS successfully for around 5 hours, we started to observe buffer overflow drops and an unacceptable increase in OS cluster search latency. Concurrently, we noticed:

  • An unexpected peak in OSIS buffer usage.
  • A decrease in SQS message processing speed.
  • A general decrease in OSIS performance.
  • An increase in OpenSearch cluster indexing latency.
  • An increase in OS cluster search latency.

These issues seemed to correlate with a slight increase in search activity on our OS cluster, alongside a drop, then a steady rise in the count of deleted documents (we are overwriting some of the documents).

Our data pipeline is designed to ingest millions of records monthly using OSIS. The speed of data ingestion is not our primary concern, as we focus on ensuring the data is ingested without causing excessive load on our OS cluster. To optimize our configuration, we've tweaked the following parameters:

  • The number of records in an S3 item.
  • The "records to accumulate" parameter in the OSIS pipeline configuration.
  • The "poll_delay" parameter in the OSIS pipeline configuration.

We're operating with one OSIS OCU to manage costs, and our primary focus is to minimize any significant latency in the OS cluster search, which directly affects our users' experience.

Below are some additional details about our setup:

  • We're ingesting data from S3 via SQS notification, with an average S3 GZIP size of 4.2MB and an average JSONL file size of 26.0 MB.
  • Each JSONL file contains approximately 250,000 rows.
  • Example row:
{"id":"0310116856_20220117","date":"2022-01-17","val":9.46,"sink_mapping_k
ey":"index1","id":"0310116856"}
  • During the ingestion, the intake processing queue message count started at 900 and dropped to 600 overtime of 5 hours, before the issues began. During the successful processing of these files, no issues occurred and OSIS and OS cluster performed within the given thresholds.

Details of our OSIS pipeline configuration (omitted and simplified):

{
    "version": "2",
    "main-ingestion-pipeline":
    {
        "source":
        {
            "s3":
            {
                "records_to_accumulate": 50,
                "notification_type": "sqs",
                "compression": "gzip",
                "acknowledgments": true,
                "codec":
                {
                    "newline":
                    {}
                },
                "sqs":
                {
                    "poll_delay": "60s",
                    "queue_url": "https://sqs.us-east-1.amazonaws.com/.../queue_url",
                    "maximum_messages": 1,
                    "visibility_timeout": "900s"
                },
                "aws":
                {
                    "region": "us-east-1",
                    "sts_role_arn": "arn:aws:iam::...:role/OSISAccessRole"
                }
            }
        },
        "processor": [
        {
            "parse_json":
            {}
        },
        {
            "delete_entries":
            {
                "with_keys": ["message", "s3"]
            }
        },
        {
            "rename_keys":
            {
                "entries": [
	                {
	                    "from_key": "snake_case",
	                    "to_key": "snakeCase",
	                    "overwrite_if_to_key_exists": true
	                },
            	]
            }
        }],
        "route": [
        {
            "index1": "/sink_mapping_key == \"index1\""
        },
        {
            "index2": "/sink_mapping_key == \"index2\""
        }],
        "sink": [
        {
            "opensearch":
            {
                "routes": ["index1"],
                "hosts": ["https://search-domain.us-east-1.es.amazonaws.com"],
                "aws":
                {
                    "aws_sigv4": true,
                    "region": "us-east-1",
                    "sts_role_arn": "arn:aws:iam::...:role/OSISAccessRole"
                },
                "index": "index1",
                "document_id_field": "id",
                "max_retries": 16,
                "dlq":
                {
                    "s3":
                    {
                        "bucket": "my-dlq-s3-bucket",
                        "key_path_prefix": "dlq-files/index1/",
                        "region": "us-east-1",
                        "sts_role_arn": "arn:aws:iam::...:role/OSISAccessRole"
                    }
                }
            }
        },
        {
            "opensearch":
            {
                "routes": ["index2"],
                "hosts": ["https://search-domain.us-east-1.es.amazonaws.com"],
                "aws":
                {
                    "aws_sigv4": true,
                    "region": "us-east-1",
                    "sts_role_arn": "arn:aws:iam::...:role/OSISAccessRole"
                },
                "index": "index2",
                "document_id_field": "id",
                "max_retries": 16,
                "dlq":
                {
                    "s3":
                    {
                        "bucket": "my-dlq-s3-bucket",
                        "key_path_prefix": "dlq-files/index2/",
                        "region": "us-east-1",
                        "sts_role_arn": "arn:aws:iam::...:role/OSISAccessRole"
                    }
                }
            }
        }]
    }
}

OS Cluster:

Dedicated master nodes
Enabled: Yes
Instance type: m5.large.search
Number of nodes: 3
UltraWarm data nodes enabled: No

Data nodes
Availability Zone(s): 1-AZ without standby
Instance type: m5.large.search
Number of nodes: 2
Storage type: EBS General Purpose (SSD) - gp2
EBS volume size: 200 GiB

Given this information, I am looking for guidance on what may have caused these performance issues and advice on how we can optimize our OSIS configuration and overall system to ensure smooth data ingestion and minimize search latency.

Any insights, advice, or recommendations from the community would be highly appreciated.

Thank you in advance.

Kasper

Kasper
asked 9 months ago115 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions