By using AWS re:Post, you agree to the Terms of Use

aws glue job fail for escaper char

0

Hi team,

I tried to load a big CSV file from s3 to RDS MySQL using AWS glue,

I have an escaper character on the file (special character).

This escape char is also defined on the crawled CSV table.

each time the job fails with an error :

An error occurred while calling o122.pyWriteDynamicFrame. Duplicate entry '123456' for key 'MySQL table.PRIMARY'

I don't have any duplicate keys on my file and the table is truncated each time before running the job.

I tried to narrow down the issue by dividing the file into chunks each chank runs successfully, but the whole file in a single job always fails with the above error.

I divided the whole file into chunks,** every chunk runs successfully and I get the full data.**

I can't figure out why ? is this a glue issue or a data issue ..?

I know the issue is related to my escape character because when I removed them the whole file passed OR when I replace my special character escape with "" the whole file passes also. is that because glue doesn't support certain escape characters (I have this issue with big files)

Not sure why the whole file with the escaper fails and when we run it in sub-chunk every chunk passes?

any idea?

glue script:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_csv", table_name = "tbl_csvxx", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "string", "id", "string"), ("col1", "string", "col1", "string"), ("date1", "string", "date2", "timestamp"), ("col2", "string", "col2", "string"), ("col3", "string", "col3", "string"), ("col4", "string", "col24", "string"), ("col5", "string", "col5", "string"),...], transformation_ctx = "applymapping1")

selectfields2 = SelectFields.apply(frame = applymapping1, paths = [ "col1", "col2", "col3", "id","col4", "col5",...], transformation_ctx = "selectfields2")

datasink3 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = selectfields2, catalog_connection = conn_name, connection_options = {"dbtable": "mysqltable", "database": db_name}, transformation_ctx = "datasink3")

sample data :

"123","2018-02-09 12:16:38.000","2018-02-09 12:16:38.000","addr1 ®" addr2®" addr3®"",,,"22","city1","121",,,,,"CC"

"456","2018-02-09 12:16:38.000","2018-02-09 12:16:38.000","sds, dssdds®"F®", sds sds, dwr, re2",,,"ree364","ABD","288",,,,,"N"

"789","2018-02-09 12:16:38.000","2018-02-09 12:16:38.000","Alle# 02, Sept# 06, sdsx,",,"SAP# ®"C®"","DPPK# 05","dssd","313","Alkl",,,"1547","P"

Thank you.

1 Answer
0

Hello,

I understand you have special character escaper in csv file due to which you are facing issue but when you change special character to "" it worked without any issue.

There are certain escaper character which is not supported (\n, \r, \t, and \0) mentioned in documentation as well.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

If your file contains a special escaper character which is supported, it should work without any issue. If you still face any issue, we require your sample data and script. Hence Please feel free to reach out to AWS Premium Support with sample data and script and we will be happy to help.

answered 5 months ago
  • I updated the description to add the script and sample date

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions