Glue: Using S3 ObjectCreated events with Crawler Catalog Target

0

I'm attempting to create a crawler using a pre-existing table defined in the catalog table, which defines a table stored in S3. I would like to use the "CRAWL_EVENT_MODE" recrawl policy, but this appears only to be available for S3 targets in the crawler, not data catalog tables that have an underlying S3 storage

Is there a way around this?

I need to have the table defined in the data catalog first, because there is no self-describing schema in the source objects, and the Crawler produces an incorrect schema when I allow it to create the table from scratch

I would also like to use S3 events to optimize the crawler behaviour and achieve near real time latency.

Thanks

  • In fact, the incremental crawl is not supported either when the target is a Data Catalog table. This seems rather inefficient, and it seems to force crawlers to be bound directly to S3 data sources, instead of via clear and well-defined tables in the catalog

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions