- Newest
- Most votes
- Most comments
HI, please check the following:
-
If the issue is caused by inconsistent partition structure, then make the structure consistent by renaming the S3 path manually or programmatically.
-
If the partition is skipped due to mismatch in file format, compression format, or schema, and the data isn't required to be included in the intended table, then consider the following:
-
Use an exclude pattern to skip any unwanted files.
-
Move the unwanted file to a different location.
-
If your data has different schemas in some input files and similar schemas in other input files, then combine compatible schemas when you create the crawler. On the Configure the crawler's output page, under Grouping behavior for S3 data (optional), select Create a single schema for each S3 path. When this setting is turned on and the data is compatible, then the crawler ignores the similarity of specific schemas when evaluating S3 objects in the specified include path. For more information, see How to create a single schema for each Amazon S3 include path.
Hi, thanks a lot for your answer.
Unfortunately I've already tried all of the above but the issue persists. I've resorted to exclude the single file for now and implemented a lambda-based etl job for this data type instead of using glue jobs.
Relevant content
- asked 3 years ago
- asked 5 years ago
- asked a year ago
- AWS OFFICIALUpdated 4 years ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 years ago
I belive the problem is with the files in the partition.
To verify, start with making a crawler that only crawls the problematic location "s3://<Bucket>/<datatype>/year=2021/month=10/"
I recommend setting the maximum number of tables to 1. If you do this and the crawler fails with "ERROR : The number of tables detected by crawler: XX is greater than the table threshold value provided: 1" you know the error.
To fix, ignoring schema similarity check by ticking "Create a single schema for each S3 path". https://aws.amazon.com/premiumsupport/knowledge-center/glue-crawler-detect-schema/
gl