schema conversion error with Glue ETL

0

Hello,

I am experience an issue when trying to use the Glue ETL on one of tables in my data catalogue. I am using the visual tool with a very simple SQL transformation on the table and when clicking on the Data Preview, I see the following error that appears to have a problem with the schema in the campaignid field for a parquet files in the bucket. The field should be a string, the Output Schema in the Glue ETL shows that the field is a string, but it appears to be expecting an int for some reason.

Error from Data Preview: Unable to retrieve data preview. Py4JJavaError: An error occurred while calling o86.getSampleDynamicFrame. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4) (172.34.231.216 executor 1): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3://reports.kanoapps.net/Reports/Data/Athena/parquet/raw_active/raw.active2022-02-25. Column: [campaignid], Expected: int, Found: BINARY at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:706) at

I have tried removing this file from the bucket to see if it works without that file, but have gotten the same error for many other files too.

When I try just running the job I get the following error in spark after the ETL job fails:

An error occurred while calling o106.pyWriteDynamicFrame. org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException

I then tried to disable the VectorizedReader to get it to read in the parquet files differently. I did this by adding the line spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false") to the script where the args are set. After doing this my ETL job still fails and I get the following error: An error occurred while calling o107.pyWriteDynamicFrame. java.lang.ClassCastException

The last thing that we attempted was to include a line to ignore any corrupt files, though I don't think my files are corrupt, just that the schema is being misinterpreted by spark. We added this line:spark.sql("SET spark.sql.files.ignoreCorruptFiles=true"). This allowed my ETL job to run without an error and created a new table in the Data Catalogue after the job completed, however the s3 bucket is completely empty. It just didn't move any data.

Has anyone experienced this schema misinterpretation before or know how to handle any of these errors? To me, it seems like the data type is correct (we want it to be a string) and it is just being mistaken as an int by spark.

Jjo
질문됨 10달 전89회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인