Schema inconsistency between Glue Data Catalog and Glue ETL Job

0

I have setup an AWS Glue Crawler to read the AWS CUR data residing in S3. Yesterday, I have enabled new Cost Allocation tags in CUR and today I can see them when I query the table in Athena. But I cant access the new columns in AWS Glue ETL job. I am reading the table in AWS Glue ETL as below.

dyf = glueContext.create_dynamic_frame.from_catalog(database=source_db,
                                                       table_name=source_tbl)
usage_df = dyf.toDF()
usage_df = usage_df.filter(filter_clause)
usage_df.printSchema() ## Schema is not showing the new fields

Tried executing MSCK REPAIR TABLE, still no luck. The Crawler property set as Update the table definition in the data catalog and its a partitioned table with year and month as partition column. Am I missing anything ?

已提問 5 個月前檢視次數 561 次
1 個回答
2
已接受的答案

DynamicFrame doesn't use the catalog, it will infer the schema from the actual data files.
DataFrame does and since you are converting to it, you can just do:

usage_df = spark.table("source_db", "source_tbl")
profile pictureAWS
專家
已回答 5 個月前
profile pictureAWS
專家
已審閱 5 個月前
  • Thanks a lot. It worked. usage_df = spark.table("source_db.source_tbl")

  • @Gonzalo Herreros can you share details around fixing it and then going back to using create_dynamic_frame.from_catalog or create_data_frame.from_catalog. Or is the expectation to use only spark.table once we have updated the schema?

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南