Schema inconsistency between Glue Data Catalog and Glue ETL Job

0

I have setup an AWS Glue Crawler to read the AWS CUR data residing in S3. Yesterday, I have enabled new Cost Allocation tags in CUR and today I can see them when I query the table in Athena. But I cant access the new columns in AWS Glue ETL job. I am reading the table in AWS Glue ETL as below.

dyf = glueContext.create_dynamic_frame.from_catalog(database=source_db,
                                                       table_name=source_tbl)
usage_df = dyf.toDF()
usage_df = usage_df.filter(filter_clause)
usage_df.printSchema() ## Schema is not showing the new fields

Tried executing MSCK REPAIR TABLE, still no luck. The Crawler property set as Update the table definition in the data catalog and its a partitioned table with year and month as partition column. Am I missing anything ?

질문됨 5달 전561회 조회
1개 답변
2
수락된 답변

DynamicFrame doesn't use the catalog, it will infer the schema from the actual data files.
DataFrame does and since you are converting to it, you can just do:

usage_df = spark.table("source_db", "source_tbl")
profile pictureAWS
전문가
답변함 5달 전
profile pictureAWS
전문가
검토됨 5달 전
  • Thanks a lot. It worked. usage_df = spark.table("source_db.source_tbl")

  • @Gonzalo Herreros can you share details around fixing it and then going back to using create_dynamic_frame.from_catalog or create_data_frame.from_catalog. Or is the expectation to use only spark.table once we have updated the schema?

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인