Ir para o conteúdo

AWS Glue issues pushing data into OpenSearch (works with less data, fails with ~20gb of data)

0

I have a glue job which is not able to push data from a single DataFrame (or DynamicFrame) into OpenSearch.

The Glue Job fails for a number of different reasons every time I run it. The most common reason I'm seeing in the logs is:

Exception raised in execute_job_submissions. exception: An error occurred while calling o3454.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 243 (rdd at EsSparkSQL.scala:103) has failed the maximum allowable number of times: 4. Most recent failure reason:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 37 partition 1

Running the exact same job, without changing a thing sometimes also fails due to

Exception raised in execute_job_submissions. exception: An error occurred while calling o3453.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 244.0 failed 4 times, most recent failure: Lost task 8.3 in stage 244.0 (TID 115763) (<IP ADDRESS> executor 27): java.net.SocketException: Connection reset

I've tried writing the data in two different ways, both using the same common opensearch_connection properties

connection_options = {
    # Connection properties
    "connectionName": "opensearch-spark",
    "path": index_path,
    "es.nodes": uri,
    "es.port": "443",
    "es.http.timeout": "120m",
    "es.transport.compression": "true",
    "es.nodes.wan.only": "true",

    # Authentication
    "es.net.http.auth.user": credentials.username,
    "es.net.http.auth.pass": credentials.password,

    # Batch configuration
    "es.batch.size.entries": "10",
    "es.batch.size.bytes": "1mb",
    "es.batch.write.refresh": "false",
    "es.batch.write.retry.count": "5",
    "es.batch.write.retry.wait": "30s",
    "es.batch.write.retry.policy": "simple",
    "es.batch.write.retry.limit": "3",

    # Misc
    "es.mapping.id": "id",
}

And the two different ways I've tried to write this is first using a DynamicFrame via:

def write(self, dynamic_frame: DynamicFrame):
    self.glue_context.write_dynamic_frame.from_options(
        frame=dynamic_frame,
        connection_type="marketplace.spark",
        connection_options=self.connection_options,
    )

And again via calling save() on the original DataFrame via:

def direct_write(self, dataframe: DataFrame):
    (
        dataframe
        .write
        .format("org.elasticsearch.spark.sql")
        .options(**self.connection_options)
        .mode("overwrite")
        .save()
    )

Both seem to result in the same outcome.

I have attempted to repartition the data just before I sent it to OpenSearch but that didn't result in improvements either. I know I'm probably missing a fundamental here, and I apologies this isn't my tech stack or area of knowledge. Just something I've inherited and need to resolve.

I don't think this is a resource issue as we've attempted running the Glue jobs over 400DPU (expensive we know) and a similar sized OpenSearch cluster.

It's interesting to note that this works fine in lower environments where we have less data, so when pushing 1-5GB I don't have any issues, the job runs quickly and completes with all the data in OpenSearch.

I'd really appricate some help in either understanding how I can identify the problem, or how I can quickly fix this?

feita há 2 anos1,3 mil visualizações
1 Resposta
0

The error mentions ShuffleMapStage, which means there is a shuffle and probably one or more nodes are dying, likely out of disk space. Try to avoid that shuffle and/or use less but larger nodes (4X/8X).

AWS
ESPECIALISTA
respondido há 2 anos
  • we're already using the 8X nodes, can you point me to how I can confirm the disk space theory? I'm finding it hard to get metrics out of cloudwatch which actually tell me anything useful

  • In the worker nodes should have some message about out of disk, maybe reduce the number of workers so it's easy to check them all

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.