Glue Crawler error: Folder partition keys do not match table partition keys

1

I have a glue crawler that continuously logs the following error: "Folder partition keys do not match table partition keys, skipped folder: s3://<Bucket>/<datatype>/year=2021/month=10/"

It's the only path that doesn't get crawled amongst a dozen or so crawlers.

The structure for all files is like this: "s3://<Bucket>/<datatype>/year=2021/month=10/day=01/file-abc.json.gz"

--but it seems the crawler stops at partition "month" and doesn't even go down to the "day" partition.

What I have checked:

  • There are no files saved directly in the "month" partition.
  • The structure is correct (month=10/day=01/file-xyz.json.gz etc)
  • All files have correct content-type and content-encoding
  • Deleting and re-uploading all files doesn't work
  • Moving files to a different path (month=oct) doesn't work
  • Deleting and creating a new crawler and classifier doesn't work
  • Re-running countless times, same result
  • Its not related to permissions (checked IAM, Lake Formation, SCP)

Configuration is as follows: { "Version": 1, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": { "AddOrUpdateBehavior": "MergeNewColumns" } }, "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" } } and SchemaChangePolicy { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "DEPRECATE_IN_DATABASE" }

Any ideas to solve this are welcome!

Thanks

  • I belive the problem is with the files in the partition.

    To verify, start with making a crawler that only crawls the problematic location "s3://<Bucket>/<datatype>/year=2021/month=10/"

    I recommend setting the maximum number of tables to 1. If you do this and the crawler fails with "ERROR : The number of tables detected by crawler: XX is greater than the table threshold value provided: 1" you know the error.

    To fix, ignoring schema similarity check by ticking "Create a single schema for each S3 path". https://aws.amazon.com/premiumsupport/knowledge-center/glue-crawler-detect-schema/

    gl

sks_dk
질문됨 2년 전3197회 조회
1개 답변
-1

HI, please check the following:

  • If the issue is caused by inconsistent partition structure, then make the structure consistent by renaming the S3 path manually or programmatically.

  • If the partition is skipped due to mismatch in file format, compression format, or schema, and the data isn't required to be included in the intended table, then consider the following:

  • Use an exclude pattern to skip any unwanted files.

  • Move the unwanted file to a different location.

  • If your data has different schemas in some input files and similar schemas in other input files, then combine compatible schemas when you create the crawler. On the Configure the crawler's output page, under Grouping behavior for S3 data (optional), select Create a single schema for each S3 path. When this setting is turned on and the data is compatible, then the crawler ignores the similarity of specific schemas when evaluating S3 objects in the specified include path. For more information, see How to create a single schema for each Amazon S3 include path.

AWS
답변함 2년 전
  • Hi, thanks a lot for your answer.

    Unfortunately I've already tried all of the above but the issue persists. I've resorted to exclude the single file for now and implemented a lambda-based etl job for this data type instead of using glue jobs.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인