Glue: Using S3 ObjectCreated events with Crawler Catalog Target

0

I'm attempting to create a crawler using a pre-existing table defined in the catalog table, which defines a table stored in S3. I would like to use the "CRAWL_EVENT_MODE" recrawl policy, but this appears only to be available for S3 targets in the crawler, not data catalog tables that have an underlying S3 storage

Is there a way around this?

I need to have the table defined in the data catalog first, because there is no self-describing schema in the source objects, and the Crawler produces an incorrect schema when I allow it to create the table from scratch

I would also like to use S3 events to optimize the crawler behaviour and achieve near real time latency.

Thanks

  • In fact, the incremental crawl is not supported either when the target is a Data Catalog table. This seems rather inefficient, and it seems to force crawlers to be bound directly to S3 data sources, instead of via clear and well-defined tables in the catalog

1 Answer
0

Hello, After reviewing the AWS Glue documentation, a solution you may follow is to configure the Amazon S3 Event Notifications to be sent to an Amazon Simple Queue Services (SQS) queue, which the crawler will use to identify the newly added or deleted objects. The SQS queue will inspect any new objects added or deleted to the crawler.

In regards to achieve the near real-time latency using the S3 events on the crawler, two options are to setup Amazon S3 or Data Catalog target. You may configure the S3 to confirm the operations of the crawl by the consumption of the Amazon S3 events once the first crawl was successful. This will lead to the crawl operations being listed into a log, showing its current activity. This will reduce the latency to near real-time.

I hope one of these solutions did help!

Some helpful information: •https://aws.amazon.com/blogs/big-data/run-aws-glue-crawlers-using-amazon-s3-event-notifications/#:~:text=You%20can%20configure%20Amazon%20S3,are%20found%2C%20the%20crawler%20stops.https://aws.amazon.com/about-aws/whats-new/2021/10/aws-glue-crawlers-amazon-s3-notifications/https://aws.amazon.com/about-aws/whats-new/2021/10/aws-glue-crawlers-amazon-s3-notifications/

AWS
answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions