AWS Glue Catalog is not fetching the New fields added to the parquet thru pyspark script

0

Hi, We have a scenario where we have multiple parquet files in S3 where we added new fields to some of the parquet files. After that ran the crawler and the glue catalog table is showing the newly added fields. However when we are reading the parquet files in a Glue Job using Pspark script with glueContext.create_dynamic_frame.from_catalog the new fields are not showing up in the schema at all. We tried with Mergeschema in additional options but it is also not working. We are expecting Glue to read from the catalog as it is updated and show Null values for the parquet files where the new fields were not there. Please help on how to get the new fields in the Glue job thru catalog read.

질문됨 2년 전251회 조회
1개 답변
0

Hello,

As you have mentioned this is an expected behaviour with the parquet files. This is because Glue and Apache Spark both use the same Parquet readers. Since the parquet files have their own schema, while reading using Glue Dynamic Frame when defining the schema for a parquet table, it picks the schema from the first file in the S3 location. It does not scan through all the files to define the schema. If the files are stored in multiples folders, then it picks the first file from the first folder. 
Generally, usage of the mergeSchema in your read should resolve such issue. Can you please verify if the syntax is correct or not:

======
datasource0 = gc.create_dynamic_frame.from_catalog(database = "tsetdb", table_name = "testtable", transformation_ctx = "datasource0",additional_options={"mergeSchema":"true"})
======

Try also using Spark SQL and check if that provides the expected result. You can enable the same for SQL queries in glue by adding this line to your script:


======
glueContext.sql("set spark.sql.parquet.mergeSchema=true")
======

I hope these help. If not, can you also try to rename the new file with the new schema such that it’s on top of the listing in s3. Not a very good approach, but can be tried.

AWS
지원 엔지니어
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠