Glue crawler to exclude all files except the ones that match a pattern

0

I have an include path like this one: s3://my-datalake/projects/. In this project folder, I have these folders within - daily-2022-11-05, daily-2022-11-06, incremental_123456, and incremental_234567 Each of these files contains a parquet file. Now, when the crawler runs, I want it to exclude everything that starts with incremental_ in its name.

I did try using incremental_**/**. This is working for one crawler and isn't working for the other one. What I meant by isn't working for the other one - when I run the crawler it isn't updating the table or is failing.

已提問 1 年前檢視次數 820 次
1 個回答
0

I've tested a crawler using the same folder structure in S3 as mentioned.

Specified include path as: s3://my-datalake/projects/

Exclude pattern as: incremental_**/**

Using above exclude pattern ignores all files under folders named 'incremental_'. The only additional thing could be that existing crawlers have "UpdateBehavior" as "LOG" - so the already created tables are not being dropped. You could try updating it to "UPDATE_IN_DATABASE" - this will recreate the tables.

Reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude

profile pictureAWS
已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南