Schema inconsistency between Glue Data Catalog and Glue ETL Job

0

I have setup an AWS Glue Crawler to read the AWS CUR data residing in S3. Yesterday, I have enabled new Cost Allocation tags in CUR and today I can see them when I query the table in Athena. But I cant access the new columns in AWS Glue ETL job. I am reading the table in AWS Glue ETL as below.

dyf = glueContext.create_dynamic_frame.from_catalog(database=source_db,
                                                       table_name=source_tbl)
usage_df = dyf.toDF()
usage_df = usage_df.filter(filter_clause)
usage_df.printSchema() ## Schema is not showing the new fields

Tried executing MSCK REPAIR TABLE, still no luck. The Crawler property set as Update the table definition in the data catalog and its a partitioned table with year and month as partition column. Am I missing anything ?

gefragt vor 5 Monaten561 Aufrufe
1 Antwort
2
Akzeptierte Antwort

DynamicFrame doesn't use the catalog, it will infer the schema from the actual data files.
DataFrame does and since you are converting to it, you can just do:

usage_df = spark.table("source_db", "source_tbl")
profile pictureAWS
EXPERTE
beantwortet vor 5 Monaten
profile pictureAWS
EXPERTE
überprüft vor 5 Monaten
  • Thanks a lot. It worked. usage_df = spark.table("source_db.source_tbl")

  • @Gonzalo Herreros can you share details around fixing it and then going back to using create_dynamic_frame.from_catalog or create_data_frame.from_catalog. Or is the expectation to use only spark.table once we have updated the schema?

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen