The methods mentioned by the other answers are correct, it is possible though to use an event-driven workflow in Glue to be triggered by S3 events in EventBridge, you can read the details in this blog post.
The trigger has a batch size in which you can specify after how many events it should start the job.
you can also find additional detail in this section of the documentation.
Triggers within workflows can start both jobs and crawlers and can be fired when jobs or crawlers complete. ... There are three types of start triggers: * Schedule – The workflow is started according to a schedule that you define. The schedule can be daily, weekly, monthly, and so on, or can be a custom schedule based on a cron expression. * On demand – The workflow is started manually from the AWS Glue console, API, or AWS CLI. * EventBridge event – The workflow is started upon the occurrence of a single Amazon EventBridge event or a batch of Amazon EventBridge events. With this trigger type, AWS Glue can be an event consumer in an event-driven architecture. Any EventBridge event type can start a workflow. A common use case is the arrival of a new object in an Amazon S3 bucket (the S3 PutObject operation).
hope this helps
Hi, Without knowing all the details, it's difficult to come up with the most optimal solution. The simplest option would be that as soon as a file is uploaded in S3, you can use S3 Event and lambda to update a counter in DynamoDB. And as soon as the counter is 20, you trigger the ETL job. Or you can build a Step Functions workflow to implement a similar logic. As an example here's an AWS blog which may be of help... https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/
Or if you know the order of files, you can have a S3 event trigger Lambda function on that file and process all the files. And in case you don't need to process the files as soon as they land, you can trigger ETL job at a time when you know the files would've landed in S3.
As part of lambada function, you can retrieve the object list and count if count is 20 then trigger the glue job. OR if you have naming convention as part of file name and can identify from the name it self that last file is received then trigger the glue job
I have a question on the above scenario/use-case. Lets say I have scheduled the Glue job for every 10 s3 events. But For some reason my customers uploaded only 8 files on a particular day. My job will not run that day as it did not reach the counter value 10. How to handle this scenario. In my case I can also get only one file on a specific day or more than 10 files on a specific day. Any idea on how to handle this scenaro?
AWS Glue - Read a 'local' file in Pythonasked 3 months ago
Trigger databrew job multiple timesasked 3 months ago
trigger glue job from s3Accepted Answer
Glue job s3 file not found exceptionasked 5 years ago
Can AWS Glue read data from different SQL Server table, generate csv files and zipping it to S3?Accepted Answerasked 7 months ago
best practice to move ETL reated files
How to pass parameters from an event rule through a glue workflow trigger to a jobasked 4 months ago
Setting ACL in S3 objects written by an AWS Glue JobAccepted AnswerEXPERTasked 3 years ago
Trigger stored procedure upon glue job succeasked 8 months ago
How do I get the output of an AWS Glue DataBrew job to be a single CSV file?Accepted Answerasked 2 years ago