Synschronous Glue Job in Step Function is slow to recognize completion of Glue Job


I am using a Step Function to execute a Glue Job. The Step Function is set to run in synchronous mode, however, there is usually a 2-4 minute lag from Glue Job completion to the point at which the Step Function considers the Glue Job complete and moves to the next step. For example, the Glue Job's last run took 15 minutes but the Step Function spent 19 minutes on this step. Has anyone else experienced this? Is my only option to execution in async mode and poll more often for completion?

asked 2 months ago62 views
1 Answer
Accepted Answer

The reason why your experiencing this delay is because Glue does not support cloudwatch event for notifying the step functions with the latest status. Same is the case with EMR as well. Currently, by default the polling schedule is every 1 minute for the first 10 minutes, then every 5 minutes thereafter. Therefore, if the job is taking more than 10 minutes to complete it's execution the you can expect a delay of an average of 2.5 minutes with 5 minutes being the worst case. The only way is we poll its status by making Describe* api call to EMR/Glue up to every 5 minutes. The step function team knows about this issue and are trying to implement a solution.

The workaround that you can implement on your end is to make use of a Lambda function to make Describe API calls to describe EMR/Glue job status more often than Step Functions does.

If you require in depth assistance about this issue then I would advice you to raise a support case with the Technical Support team of Step Functions.

profile picture
answered 2 months ago
  • Thank you for responding. We'll implement a workaround for now.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions