Glue Crawler error: Folder partition keys do not match table partition keys

1

I have a glue crawler that continuously logs the following error: "Folder partition keys do not match table partition keys, skipped folder: s3://<Bucket>/<datatype>/year=2021/month=10/"

It's the only path that doesn't get crawled amongst a dozen or so crawlers.

The structure for all files is like this: "s3://<Bucket>/<datatype>/year=2021/month=10/day=01/file-abc.json.gz"

--but it seems the crawler stops at partition "month" and doesn't even go down to the "day" partition.

What I have checked:

  • There are no files saved directly in the "month" partition.
  • The structure is correct (month=10/day=01/file-xyz.json.gz etc)
  • All files have correct content-type and content-encoding
  • Deleting and re-uploading all files doesn't work
  • Moving files to a different path (month=oct) doesn't work
  • Deleting and creating a new crawler and classifier doesn't work
  • Re-running countless times, same result
  • Its not related to permissions (checked IAM, Lake Formation, SCP)

Configuration is as follows: { "Version": 1, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": { "AddOrUpdateBehavior": "MergeNewColumns" } }, "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" } } and SchemaChangePolicy { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "DEPRECATE_IN_DATABASE" }

Any ideas to solve this are welcome!

Thanks

  • I belive the problem is with the files in the partition.

    To verify, start with making a crawler that only crawls the problematic location "s3://<Bucket>/<datatype>/year=2021/month=10/"

    I recommend setting the maximum number of tables to 1. If you do this and the crawler fails with "ERROR : The number of tables detected by crawler: XX is greater than the table threshold value provided: 1" you know the error.

    To fix, ignoring schema similarity check by ticking "Create a single schema for each S3 path". https://aws.amazon.com/premiumsupport/knowledge-center/glue-crawler-detect-schema/

    gl

sks_dk
asked 2 years ago3149 views
1 Answer
-1

HI, please check the following:

  • If the issue is caused by inconsistent partition structure, then make the structure consistent by renaming the S3 path manually or programmatically.

  • If the partition is skipped due to mismatch in file format, compression format, or schema, and the data isn't required to be included in the intended table, then consider the following:

  • Use an exclude pattern to skip any unwanted files.

  • Move the unwanted file to a different location.

  • If your data has different schemas in some input files and similar schemas in other input files, then combine compatible schemas when you create the crawler. On the Configure the crawler's output page, under Grouping behavior for S3 data (optional), select Create a single schema for each S3 path. When this setting is turned on and the data is compatible, then the crawler ignores the similarity of specific schemas when evaluating S3 objects in the specified include path. For more information, see How to create a single schema for each Amazon S3 include path.

AWS
answered 2 years ago
  • Hi, thanks a lot for your answer.

    Unfortunately I've already tried all of the above but the issue persists. I've resorted to exclude the single file for now and implemented a lambda-based etl job for this data type instead of using glue jobs.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions