Why doesn't AWS Glue add partitions to a table during an incremental crawl?

2 minute read
0

I want to troubleshoot partitions that are missing after I run an incremental AWS Glue crawl.

Short description

When an AWS Glue crawler runs an incremental crawl, it identifies only partitions that the crawler added after the previous crawl. To add the partition, more than 70% of the files in a partition must have the same schema as the table for the crawler.

Important: An AWS Glue crawler can't add a partition that's previously flagged as a schema mismatch. It's a best practice to make sure that all your new partition's properties match the original table's properties before your crawler runs.

Resolution

Open the Amazon CloudWatch log that corresponds with your crawler's last crawl, and then search for the new partition's Amazon Simple Storage Service (Amazon S3) prefix. If the new partition's schema and the original table's schema don't match, then a "Partition does not match table schema or has mismatch keys" message appears.

If you receive the preceding error message, then verify that the following properties in the new partition and the original table match:

  • Compression format
  • File type
  • File schema

Make sure that the new partition's S3 structure matches the original table's S3 structure. For example, If the original table's S3 structure uses the yyyy-mm-dd date format, then the new partition's S3 structure must also use the yyyy-mm-dd date format. If the properties don't match, then modify the files in the new partition to match the original table.

Then, use Athena to add the new partition to the table. For hive style partitions, run the MSCK REPAIR TABLE command. For non-hive style partitions, run the ALTER TABLE ADD PARTITION command.

AWS OFFICIAL
AWS OFFICIALUpdated 20 days ago