スキップしてコンテンツを表示

AWS Glue issues pushing data into OpenSearch (works with less data, fails with ~20gb of data)

0

I have a glue job which is not able to push data from a single DataFrame (or DynamicFrame) into OpenSearch.

The Glue Job fails for a number of different reasons every time I run it. The most common reason I'm seeing in the logs is:

Exception raised in execute_job_submissions. exception: An error occurred while calling o3454.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 243 (rdd at EsSparkSQL.scala:103) has failed the maximum allowable number of times: 4. Most recent failure reason:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 37 partition 1

Running the exact same job, without changing a thing sometimes also fails due to

Exception raised in execute_job_submissions. exception: An error occurred while calling o3453.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 244.0 failed 4 times, most recent failure: Lost task 8.3 in stage 244.0 (TID 115763) (<IP ADDRESS> executor 27): java.net.SocketException: Connection reset

I've tried writing the data in two different ways, both using the same common opensearch_connection properties

connection_options = {
    # Connection properties
    "connectionName": "opensearch-spark",
    "path": index_path,
    "es.nodes": uri,
    "es.port": "443",
    "es.http.timeout": "120m",
    "es.transport.compression": "true",
    "es.nodes.wan.only": "true",

    # Authentication
    "es.net.http.auth.user": credentials.username,
    "es.net.http.auth.pass": credentials.password,

    # Batch configuration
    "es.batch.size.entries": "10",
    "es.batch.size.bytes": "1mb",
    "es.batch.write.refresh": "false",
    "es.batch.write.retry.count": "5",
    "es.batch.write.retry.wait": "30s",
    "es.batch.write.retry.policy": "simple",
    "es.batch.write.retry.limit": "3",

    # Misc
    "es.mapping.id": "id",
}

And the two different ways I've tried to write this is first using a DynamicFrame via:

def write(self, dynamic_frame: DynamicFrame):
    self.glue_context.write_dynamic_frame.from_options(
        frame=dynamic_frame,
        connection_type="marketplace.spark",
        connection_options=self.connection_options,
    )

And again via calling save() on the original DataFrame via:

def direct_write(self, dataframe: DataFrame):
    (
        dataframe
        .write
        .format("org.elasticsearch.spark.sql")
        .options(**self.connection_options)
        .mode("overwrite")
        .save()
    )

Both seem to result in the same outcome.

I have attempted to repartition the data just before I sent it to OpenSearch but that didn't result in improvements either. I know I'm probably missing a fundamental here, and I apologies this isn't my tech stack or area of knowledge. Just something I've inherited and need to resolve.

I don't think this is a resource issue as we've attempted running the Glue jobs over 400DPU (expensive we know) and a similar sized OpenSearch cluster.

It's interesting to note that this works fine in lower environments where we have less data, so when pushing 1-5GB I don't have any issues, the job runs quickly and completes with all the data in OpenSearch.

I'd really appricate some help in either understanding how I can identify the problem, or how I can quickly fix this?

質問済み 2年前1282ビュー
1回答
0

The error mentions ShuffleMapStage, which means there is a shuffle and probably one or more nodes are dying, likely out of disk space. Try to avoid that shuffle and/or use less but larger nodes (4X/8X).

AWS
エキスパート
回答済み 2年前
  • we're already using the 8X nodes, can you point me to how I can confirm the disk space theory? I'm finding it hard to get metrics out of cloudwatch which actually tell me anything useful

  • In the worker nodes should have some message about out of disk, maybe reduce the number of workers so it's easy to check them all

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

関連するコンテンツ