backslash in CSV with glue

0

Hi team, I have an AWS glue job that reads data from the CSV file in s3 and injects the data on a table in MySQL RDS Aurora DB. the escapeChar used on the CSV file is the backslash ().

I have 2 issues with glue while loading the CSV file :

1 - Let's say I have this string "Hello\John". This will be imported as "HelloJohn" (since the backslash is defined as the escape character. So the data integrity is not respected.

2 - there are some lines on the CSV file where the backslash is ** right before a closing double quote**. For example:

"aaa\blah blah blah\blah blah".

This one breaks the parsers completely (the crawler doesn't detect at all the file columns and their data type, blank result after crawling) because it effectively escapes the closing double quote.

is there a way to configure glue to deal with backslashes to avoid the above 2 issues? (I already defined escapeChar = \ on the crawler)

1 Answer
0
  1. Tried reading the mentioned sample data from a CSV file using a Glue dynamic frame and Spark dataframe. However, in both the cases, I observed that the special character was being preserved.
df = spark.read.option("header", "true").csv("<s3-path-to-file>")
testdf.show()
+----------+
|      colA|
+----------+
|Hello\John|
+----------+
datasource0 = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["<s3-path-to-file>"]}, format="csv",transformation_ctx = "datasource0")
datasource0.toDF().show()
+----------+
|      col0|
+----------+
|Hello\John|
+----------+

So, please make sure your code is not doing any other processing before writing the data to MySQL.

  1. Created a CSV file with mentioned sample data and ran a crawler with CSV classifier (Quote symbol - ""). Crawler was able to create the table and I was able to query the data from Athena.
SELECT * FROM "default"."csv" limit 10;
col1          col3          col0          col1
-------------------------------------------------
aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\
aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\

Because you mentioned crawler produced blank table, make sure your CSV file has at least two columns and two rows of data as per documentation, for the crawler to be able to determine a table.

For any other questions specific to a job/crawler in your account, please reach out to AWS Premium Support.

AWS
SUPPORT ENGINEER
Manu_G
answered 2 years ago
  • Thank you for your answer, in my case I don't read the file directly with from_options, I just create a crawler point to s3 path all files have more than 2 columns and to load data I create a new job and choose the source crawled table and destination crawled table.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions