schema conversion error with Glue ETL

0

Hello,

I am experience an issue when trying to use the Glue ETL on one of tables in my data catalogue. I am using the visual tool with a very simple SQL transformation on the table and when clicking on the Data Preview, I see the following error that appears to have a problem with the schema in the campaignid field for a parquet files in the bucket. The field should be a string, the Output Schema in the Glue ETL shows that the field is a string, but it appears to be expecting an int for some reason.

Error from Data Preview: Unable to retrieve data preview. Py4JJavaError: An error occurred while calling o86.getSampleDynamicFrame. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4) (172.34.231.216 executor 1): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3://reports.kanoapps.net/Reports/Data/Athena/parquet/raw_active/raw.active2022-02-25. Column: [campaignid], Expected: int, Found: BINARY at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:706) at

I have tried removing this file from the bucket to see if it works without that file, but have gotten the same error for many other files too.

When I try just running the job I get the following error in spark after the ETL job fails:

An error occurred while calling o106.pyWriteDynamicFrame. org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException

I then tried to disable the VectorizedReader to get it to read in the parquet files differently. I did this by adding the line spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false") to the script where the args are set. After doing this my ETL job still fails and I get the following error: An error occurred while calling o107.pyWriteDynamicFrame. java.lang.ClassCastException

The last thing that we attempted was to include a line to ignore any corrupt files, though I don't think my files are corrupt, just that the schema is being misinterpreted by spark. We added this line:spark.sql("SET spark.sql.files.ignoreCorruptFiles=true"). This allowed my ETL job to run without an error and created a new table in the Data Catalogue after the job completed, however the s3 bucket is completely empty. It just didn't move any data.

Has anyone experienced this schema misinterpretation before or know how to handle any of these errors? To me, it seems like the data type is correct (we want it to be a string) and it is just being mistaken as an int by spark.

Jjo
asked 9 months ago86 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions