Crawler not excluding files despite Cloudwatch showing it acknowledges file matches exclusion pattern

0

Hi,

New to AWS! I'm running a crawler through a folder in an s3 bucket of roughly 1000 files which are a mix of xlsx and csv. I want the crawler to pass the csv files into a table that I will transform using Athena. New files of both types will continue to be added to the folder so I'm not looking for a one time fix relevant to only the current files there.

The structure is similar to below: s3bucket/basefolder/ aaa_Report.xlsx aab_Report.xlsx aac_Report.xlsx ... aaa.csv aab.csv aac.csv ...

I have used a number of different exclusion patterns from *.xlsx, **.xlsx, * Report * and the likes and still the xlsx files would appear in the table. I eventually realised that CloudWatch showed that the xlsx files were being identified as matching *.xlsx and it claims that they were being excluded when they clearly aren't from the final tables.

Does anyone have any advice on what I may be doing wrong, how to fix it or any alternative methods to achieve my desired result of one table only containing csv files from this constantly updated bucket?

Cheers

asked 8 months ago201 views
1 Answer
0

Hi,

Maybe someone will know better than me but it might be worth checking that your include path is configured correctly and pointing to the right bucket if you haven't done so already?

From the doc here

*When evaluating what to include or exclude in a crawl, a crawler starts by evaluating the required include path. For Amazon S3, MongoDB, MongoDB Atlas, Amazon DocumentDB (with MongoDB compatibility), and relational data stores, you must specify an include path.

For Amazon S3 data stores, include path syntax is bucket-name/folder-name/file-name.ext. To crawl all objects in a bucket, you specify just the bucket name in the include path. The exclude pattern is relative to the include path*

profile picture
answered 8 months ago
  • Hi James, thanks for the response!

    I'm fairly certain my path is configured correctly or else none of my files would make it to the table. Also the crawler is able to see the files or else CloudWatch wouldn't claim that it is excluding them

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions