AWS Glue Jobs 3.0 Unable to parse CSV file

0

Hello, I am running a job to apply an ETL on a semi-colon-separated CSV on S3, however, when I read the file using the DynamicFrame feature of AWS and try to use any method like printSchema or toDF, I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o77.schema.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (52bff5da55da executor driver): com.amazonaws.services.glue.util.FatalException: Unable to parse file: s3://my-bucket/my-file.csv

I have already verified the codification, it is UTF-8 so there should be no problem. When I read the CSV using spark.read.csv, it works fine, and the Crawlers can also recognize the schema. The data has some special characters that shouldn't be there, and that's part of the ETL I am looking to perform. Neither using the from_catalog nor from_options function from AWS Glue works, the problem is the same whether I run the job locally on docker or Glue Studio. My data have a folder date partition so I would prefer to avoid using directly Spark to read the data and take advantage of the Glue Data Catalog as well.

Thanks in advance.

  • Can you share the full stacktrace, there should be a "Caused by" part giving more information

1 Answer
1
Accepted Answer

Hello,

Error "Fatal exception com.amazonaws.services.glue.readers unable to parse file *.csv" is observed when CSV is either not "UTF-8" encoded or if it has non UTF-8 encoded characters.

As you have already verified that your CSV file is UTF-8 encoded and you mentioned that data has some special characters, please try running ETL job after removing these characters from file.

Using a semi colon separated CSV file I have tested in my personal AWS account, I was able to run Glue ETL job successfully using the from_catalog function.
Steps I followed for this test:
1. 	Created a sample semi colon seprated CSV file.
2. 	Created the table from this file using Glue Crawler.
3. 	Created a Glue ETL job to read this table using from_catalog function
profile pictureAWS
SUPPORT ENGINEER
answered a year ago
profile picture
EXPERT
reviewed 3 months ago
profile picture
EXPERT
reviewed 6 months ago
  • Thank you so much, when I did the conversion to CSV UTF-8 it appears the code was corrupted so the problem persisted but removing some characters before the conversion worked.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions