How can I fix this error SchemaColumnConvertNotSupportedException?

0

Hello everyone, I just started using Glue so forgive me if the question is stupid or I'm not providing the correct information to solve the problem. I've been facing this issue for the past two days and I cannot seem to solve it. I'm running a Glue Job where I read a table from the Glue catalog as a dynamic frame and then I turn it into a Spark dataframe to create some views and preprocess the data in the way I want. Everytime I try loading up my results on S3, converting the final dataframe from spark to Glue Dynamic Frame or even just trying a df.show() I get the error org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException. I tried to dissect the query to find the mistake, and I found out that even if I just load up the data from the Glue Catalog (S3 data passed by a Crawler), turn it into a Spark dataframe, create a temp view, run a simple query ('Select * from tempview') and try load this results on S3, I still get this error. If I go in the error logs I find an error like this:

org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3://bucket_name/folder_name/partition1=x/partition2=y/file.parquet. Column: [ColumnZ], Expected: string, Found: INT32

I really don't know how to fix this kind of error, also given that I get it just by performing simple operations as the ones described above. If someone could help me I would really appreciate it, I'm really desperate.

asked 6 months ago777 views
2 Answers
0

That means the schema of your files is inconsistent and that column is generalized as string but that is problematic in itself.
Assuming you can't fix the parquet files to be consistent (or the table is partitioned and files are consistent within each partition), you still might be able to workaround.
Looking at the error, I would say you are reading as DataFrame and not DynamicFrame, which is more flexible in these aspects.
Can you share the reading part of the code and the full stack trace?

profile pictureAWS
EXPERT
answered 6 months ago
0

Has this problem been resolved? Alternatively, what is the solution? Is it possible to rectify the data type using AWS Glue Spark? How do we manage situations where there are varying data types across multiple files, particularly in parquet format?

answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions