trigger glue job from s3

0

Hi team,

I have an AWS glue job that read 20 CSV files from S3 and injects them to MySQL RDS,

I want to trigger the glue job only after all 20 files are in s3 (they won’t arrive at exactly the same time).

how can I configure the glue job/event rule to start only and only if all the 20 files are there in s3?

Thank you

4 Answers
1
Accepted Answer

The methods mentioned by the other answers are correct, it is possible though to use an event-driven workflow in Glue to be triggered by S3 events in EventBridge, you can read the details in this blog post.

The trigger has a batch size in which you can specify after how many events it should start the job.

you can also find additional detail in this section of the documentation.

Triggers within workflows can start both jobs and crawlers and can be fired when jobs or crawlers complete.  
...  
There are three types of start triggers:

* Schedule – The workflow is started according to a schedule that you define. The schedule can be daily, weekly, monthly, and so on, or can be a custom schedule based on a cron expression.

* On demand – The workflow is started manually from the AWS Glue console, API, or AWS CLI.

* EventBridge event – The workflow is started upon the occurrence of a single Amazon EventBridge event or a batch of Amazon EventBridge events. With this trigger type, AWS Glue can be an event consumer in an event-driven architecture. Any EventBridge event type can start a workflow. A common use case is the arrival of a new object in an Amazon S3 bucket (the S3 PutObject operation).

hope this helps

AWS
EXPERT
answered 2 years ago
profile picture
EXPERT
reviewed a month ago
1

Hi, Without knowing all the details, it's difficult to come up with the most optimal solution. The simplest option would be that as soon as a file is uploaded in S3, you can use S3 Event and lambda to update a counter in DynamoDB. And as soon as the counter is 20, you trigger the ETL job. Or you can build a Step Functions workflow to implement a similar logic. As an example here's an AWS blog which may be of help... https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/

Or if you know the order of files, you can have a S3 event trigger Lambda function on that file and process all the files. And in case you don't need to process the files as soon as they land, you can trigger ETL job at a time when you know the files would've landed in S3.

AWS
answered 2 years ago
0

As part of lambada function, you can retrieve the object list and count if count is 20 then trigger the glue job. OR if you have naming convention as part of file name and can identify from the name it self that last file is received then trigger the glue job

AWS
Zahid
answered 2 years ago
  • Thank you, when this lambda will be triggered? on schedule or each time a new object is uploaded?

0

I have a question on the above scenario/use-case. Lets say I have scheduled the Glue job for every 10 s3 events. But For some reason my customers uploaded only 8 files on a particular day. My job will not run that day as it did not reach the counter value 10. How to handle this scenario. In my case I can also get only one file on a specific day or more than 10 files on a specific day. Any idea on how to handle this scenaro?

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions