Glue job fail many workers

0

Hi team,

I have a glue job that reads CSV files S3 to MySQL RDS. I set up** 2 workers** on my glue job,

the job fails for one big file with the flowing error :

An error occurred while calling o122.pyWriteDynamicFrame. Duplicate entry '123456' for key 'mySqlTable.PRIMARY'

I checked on the CSV file in question I have only one single line with the value '123456' so no duplication on the file.

is that error because 2 workers try to insert the same data at the same time? so that It generates the duplicate key error?

I tried with 4 workers => same error

I wanted to try with 1 worker but seems the minimum number of workers should be 2 or greater.

I truncate my SQL table every time and make sure that's empty before each run.

thank you.

2 Answers
0

Is the data already present in the MySQL table?

AWS Glue is based on Spark, as you know, 1 node is the Spark driver and the other nodes are hosting the executors (doing the real works), in you case with just 2 workers you have only 1 executor in your cluster.

Even with multiple executor, spark partitions the data without replicating it this means that even parallel writes will never try to enter the same data twice, unless as already mentioned your code has introduced the duplicates.

Glue works in Append mode only. no updates, have you checked if that key is, by any chance, already present in the target database?

hope this helps,

AWS
EXPERT
answered 2 years ago
  • I truncate my SQL table every time and make sure that's empty before each run. I tried with 2 workers and 4 workers always the same error. the previous file passes successfully this file cause always this error from glue

  • @Jess could you please pose an anonymized snippet of the code generated? I have not experienced that and not sure how to reproduce it.

  • @Jess could you please pose an anonymized snippet of the code generated? I have not experienced that and not sure how to reproduce it.

0

Are you doing some transformation, join, which can produce duplicate data? Is it happening to all the keys or single key only. I would suggest to remove PK or unique constraint and let the job complete once. Then you can evaluate in RDS is it true duplicate or some transformation is generating duplicate

AWS
Zahid
answered 2 years ago
  • no, it's direct mapping no transformation nor join it happened on a single key each run

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions