Glue Start-Blueprint-Run running into timeout issues with increased number of jobs

0

Describe the bug Using AWS-CLI (Version: aws-cli/2.8.6 Python/3.9.11 Windows/10 exe/AMD64 prompt/off), starting a Glue Blueprint Run works fine when the number of objects generated inside the workflow (Triggers/Glue Jobs) is under 30-40 objects in total. But when there's more objects being generated inside the glue workflow, the Blueprint Run seems to be timing out and gets stuck in RUNNING state.

Expected Behavior We expect the Blueprint Run to spin up the workflow with all the number of jobs as needed. This is a sample example of the workflow where each row consists of 2 jobs with a trigger in the middle for each task, and there could be N number of tasks like this:

AWS Glue workflow from Blueprint run

Current Behavior These number of rows of tasks when under 8-10 tasks, the blueprint run is successful and doesn't time out but when it's more the Blueprint Run is stuck in RUNNING state and we never get the workflow generated.

Reproduction Steps This is the AWS CLI command we're using right now: aws glue start-blueprint-run --blueprint-name BLUEPRINT_NAME --role-arn IAMRoleARN --parameters "file://FILE_PATH.json" --region us-east-1 --profile test-naga --cli-connect-timeout 900 --cli-read-timeout 900

The JSON object takes in a collection of table names and loops over them and the layout file creates the workflow as shown in the image above. It's not easy to reproduce this with the exact same example but maybe one of the samples here https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/samples/crawl_s3_locations could be used but for a higher number of jobs/objects created through the Blueprint Run

Possible Solution I'm wondering if it's related to the --cli-connect-timeout and --cli-read-timeout issue as the default value is 60 seconds and it seems like the Blueprint Run tries to spin up all the resources in that time but if there are more objects to spin up and it crosses this time, the whole process times out and gets stuck in the RUNNING state without doing anything.

We also tried setting these values to 0 and still the same issue. The number of objects it spins up when timing out seems to be random across each runs.

CLI version used 2.8.6

Environment details (OS name and version, etc.) Windows 10

asked a year ago53 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions