- Neueste
- Die meisten Stimmen
- Die meisten Kommentare
Hello, After reviewing the AWS Glue documentation, a solution you may follow is to configure the Amazon S3 Event Notifications to be sent to an Amazon Simple Queue Services (SQS) queue, which the crawler will use to identify the newly added or deleted objects. The SQS queue will inspect any new objects added or deleted to the crawler.
In regards to achieve the near real-time latency using the S3 events on the crawler, two options are to setup Amazon S3 or Data Catalog target. You may configure the S3 to confirm the operations of the crawl by the consumption of the Amazon S3 events once the first crawl was successful. This will lead to the crawl operations being listed into a log, showing its current activity. This will reduce the latency to near real-time.
I hope one of these solutions did help!
Some helpful information: •https://aws.amazon.com/blogs/big-data/run-aws-glue-crawlers-using-amazon-s3-event-notifications/#:~:text=You%20can%20configure%20Amazon%20S3,are%20found%2C%20the%20crawler%20stops. •https://aws.amazon.com/about-aws/whats-new/2021/10/aws-glue-crawlers-amazon-s3-notifications/ •https://aws.amazon.com/about-aws/whats-new/2021/10/aws-glue-crawlers-amazon-s3-notifications/
Relevanter Inhalt
- AWS OFFICIALAktualisiert vor 2 Jahren
- AWS OFFICIALAktualisiert vor 3 Jahren
- AWS OFFICIALAktualisiert vor 3 Jahren
- AWS OFFICIALAktualisiert vor einem Jahr
In fact, the incremental crawl is not supported either when the target is a Data Catalog table. This seems rather inefficient, and it seems to force crawlers to be bound directly to S3 data sources, instead of via clear and well-defined tables in the catalog