Schema inconsistency between Glue Data Catalog and Glue ETL Job

0

I have setup an AWS Glue Crawler to read the AWS CUR data residing in S3. Yesterday, I have enabled new Cost Allocation tags in CUR and today I can see them when I query the table in Athena. But I cant access the new columns in AWS Glue ETL job. I am reading the table in AWS Glue ETL as below.

dyf = glueContext.create_dynamic_frame.from_catalog(database=source_db,
                                                       table_name=source_tbl)
usage_df = dyf.toDF()
usage_df = usage_df.filter(filter_clause)
usage_df.printSchema() ## Schema is not showing the new fields

Tried executing MSCK REPAIR TABLE, still no luck. The Crawler property set as Update the table definition in the data catalog and its a partitioned table with year and month as partition column. Am I missing anything ?

asked 5 months ago526 views
1 Answer
2
Accepted Answer

DynamicFrame doesn't use the catalog, it will infer the schema from the actual data files.
DataFrame does and since you are converting to it, you can just do:

usage_df = spark.table("source_db", "source_tbl")
profile pictureAWS
EXPERT
answered 5 months ago
profile pictureAWS
EXPERT
reviewed 5 months ago
  • Thanks a lot. It worked. usage_df = spark.table("source_db.source_tbl")

  • @Gonzalo Herreros can you share details around fixing it and then going back to using create_dynamic_frame.from_catalog or create_data_frame.from_catalog. Or is the expectation to use only spark.table once we have updated the schema?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions