AWS Glue Catalog is not fetching the New fields added to the parquet thru pyspark script

0

Hi, We have a scenario where we have multiple parquet files in S3 where we added new fields to some of the parquet files. After that ran the crawler and the glue catalog table is showing the newly added fields. However when we are reading the parquet files in a Glue Job using Pspark script with glueContext.create_dynamic_frame.from_catalog the new fields are not showing up in the schema at all. We tried with Mergeschema in additional options but it is also not working. We are expecting Glue to read from the catalog as it is updated and show Null values for the parquet files where the new fields were not there. Please help on how to get the new fields in the Glue job thru catalog read.

已提問 2 年前檢視次數 253 次
1 個回答
0

Hello,

As you have mentioned this is an expected behaviour with the parquet files. This is because Glue and Apache Spark both use the same Parquet readers. Since the parquet files have their own schema, while reading using Glue Dynamic Frame when defining the schema for a parquet table, it picks the schema from the first file in the S3 location. It does not scan through all the files to define the schema. If the files are stored in multiples folders, then it picks the first file from the first folder. 
Generally, usage of the mergeSchema in your read should resolve such issue. Can you please verify if the syntax is correct or not:

======
datasource0 = gc.create_dynamic_frame.from_catalog(database = "tsetdb", table_name = "testtable", transformation_ctx = "datasource0",additional_options={"mergeSchema":"true"})
======

Try also using Spark SQL and check if that provides the expected result. You can enable the same for SQL queries in glue by adding this line to your script:


======
glueContext.sql("set spark.sql.parquet.mergeSchema=true")
======

I hope these help. If not, can you also try to rename the new file with the new schema such that it’s on top of the listing in s3. Not a very good approach, but can be tried.

AWS
支援工程師
已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南