backslash in CSV with glue

0

Hi team, I have an AWS glue job that reads data from the CSV file in s3 and injects the data on a table in MySQL RDS Aurora DB. the escapeChar used on the CSV file is the backslash ().

I have 2 issues with glue while loading the CSV file :

1 - Let's say I have this string "Hello\John". This will be imported as "HelloJohn" (since the backslash is defined as the escape character. So the data integrity is not respected.

2 - there are some lines on the CSV file where the backslash is ** right before a closing double quote**. For example:

"aaa\blah blah blah\blah blah".

This one breaks the parsers completely (the crawler doesn't detect at all the file columns and their data type, blank result after crawling) because it effectively escapes the closing double quote.

is there a way to configure glue to deal with backslashes to avoid the above 2 issues? (I already defined escapeChar = \ on the crawler)

1 個回答
0
  1. Tried reading the mentioned sample data from a CSV file using a Glue dynamic frame and Spark dataframe. However, in both the cases, I observed that the special character was being preserved.
df = spark.read.option("header", "true").csv("<s3-path-to-file>")
testdf.show()
+----------+
|      colA|
+----------+
|Hello\John|
+----------+
datasource0 = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["<s3-path-to-file>"]}, format="csv",transformation_ctx = "datasource0")
datasource0.toDF().show()
+----------+
|      col0|
+----------+
|Hello\John|
+----------+

So, please make sure your code is not doing any other processing before writing the data to MySQL.

  1. Created a CSV file with mentioned sample data and ran a crawler with CSV classifier (Quote symbol - ""). Crawler was able to create the table and I was able to query the data from Athena.
SELECT * FROM "default"."csv" limit 10;
col1          col3          col0          col1
-------------------------------------------------
aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\
aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\

Because you mentioned crawler produced blank table, make sure your CSV file has at least two columns and two rows of data as per documentation, for the crawler to be able to determine a table.

For any other questions specific to a job/crawler in your account, please reach out to AWS Premium Support.

AWS
支援工程師
Manu_G
已回答 2 年前
profile picture
專家
已審閱 8 天前
  • Thank you for your answer, in my case I don't read the file directly with from_options, I just create a crawler point to s3 path all files have more than 2 columns and to load data I create a new job and choose the source crawled table and destination crawled table.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南