Schema inconsistency between Glue Data Catalog and Glue ETL Job

0

I have setup an AWS Glue Crawler to read the AWS CUR data residing in S3. Yesterday, I have enabled new Cost Allocation tags in CUR and today I can see them when I query the table in Athena. But I cant access the new columns in AWS Glue ETL job. I am reading the table in AWS Glue ETL as below.

dyf = glueContext.create_dynamic_frame.from_catalog(database=source_db,
                                                       table_name=source_tbl)
usage_df = dyf.toDF()
usage_df = usage_df.filter(filter_clause)
usage_df.printSchema() ## Schema is not showing the new fields

Tried executing MSCK REPAIR TABLE, still no luck. The Crawler property set as Update the table definition in the data catalog and its a partitioned table with year and month as partition column. Am I missing anything ?

已提问 5 个月前561 查看次数
1 回答
2
已接受的回答

DynamicFrame doesn't use the catalog, it will infer the schema from the actual data files.
DataFrame does and since you are converting to it, you can just do:

usage_df = spark.table("source_db", "source_tbl")
profile pictureAWS
专家
已回答 5 个月前
profile pictureAWS
专家
已审核 5 个月前
  • Thanks a lot. It worked. usage_df = spark.table("source_db.source_tbl")

  • @Gonzalo Herreros can you share details around fixing it and then going back to using create_dynamic_frame.from_catalog or create_data_frame.from_catalog. Or is the expectation to use only spark.table once we have updated the schema?

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则