backslash in CSV with glue

Question

Hi team, I have an AWS glue job that reads data from the CSV file in s3 and injects the data on a table in MySQL RDS Aurora DB.
the escapeChar used on the CSV file is the backslash (\).

I have 2 issues with glue while loading the CSV file :

1 - Let's say I have this string "Hello\John". This will be imported as "HelloJohn" (since the backslash is defined as the escape character. So the data integrity is not respected.

2 -  there are some lines on the CSV file where the backslash is ** right before a closing double quote**. For example:

> "aaa\blah blah blah\blah blah\".

This one breaks the parsers completely (the crawler doesn't detect at all the file columns and their data type, blank result after crawling) 
because it effectively escapes the closing double quote.

is there a way to configure glue to deal with backslashes to avoid the above 2 issues? (I already defined escapeChar = \ on the crawler)

Answer

1. Tried reading the mentioned sample data from a CSV file using a Glue dynamic frame and Spark dataframe. However, in both the cases, I observed that the special character was being preserved.

```
df = spark.read.option("header", "true").csv("")
testdf.show()
+----------+
|      colA|
+----------+
|Hello\John|
+----------+
```
```
datasource0 = glueContext.create_dynamic_frame_from_options("s3", {'paths': [""]}, format="csv",transformation_ctx = "datasource0")
datasource0.toDF().show()
+----------+
|      col0|
+----------+
|Hello\John|
+----------+
```
So, please make sure your code is not doing any other processing before writing the data to MySQL.

2. Created a CSV file with mentioned sample data and ran a crawler with CSV classifier (Quote symbol - "\"). Crawler was able to create the table and I was able to query the data from Athena.

```
SELECT * FROM "default"."csv" limit 10;
col1          col3          col0          col1
-------------------------------------------------
aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\
aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\	aaa\blah blah blah\blah blah\
```
Because you mentioned crawler produced blank table, make sure your CSV file has **at least two columns and two rows of data** as per [documentation](https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html#classifier-built-in), for the crawler to be able to determine a table.

For any other questions specific to a job/crawler in your account, please reach out to AWS Premium Support.

backslash in CSV with glue

相關內容