Glue crawler to exclude all files except the ones that match a pattern


I have an include path like this one: s3://my-datalake/projects/. In this project folder, I have these folders within - daily-2022-11-05, daily-2022-11-06, incremental_123456, and incremental_234567 Each of these files contains a parquet file. Now, when the crawler runs, I want it to exclude everything that starts with incremental_ in its name.

I did try using incremental_**/**. This is working for one crawler and isn't working for the other one. What I meant by isn't working for the other one - when I run the crawler it isn't updating the table or is failing.

asked a year ago895 views
1 Answer

I've tested a crawler using the same folder structure in S3 as mentioned.

Specified include path as: s3://my-datalake/projects/

Exclude pattern as: incremental_**/**

Using above exclude pattern ignores all files under folders named 'incremental_'. The only additional thing could be that existing crawlers have "UpdateBehavior" as "LOG" - so the already created tables are not being dropped. You could try updating it to "UPDATE_IN_DATABASE" - this will recreate the tables.

Reference -

profile pictureAWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions