Glue Python job runs multiple times with different arguments

0

hi Team ,

My requirement is , let's say I have 10 different glue pyspark jobs ( job1, job2,.............job10) , once job1 execute and succeed then it start next glue python shell ( let's say name of script is : glue_common_python_shell.py ) Glue python shell will be one script with different arguments each time.

in other words , glue pyspark job1 completes -- >trigger --> glue_common_python_shell.py ( with arguments --table_name1) glue pyspark job2 completes -- >trigger --> glue_common_python_shell.py ( with arguments --table_name2) ...................... glue pyspark job10 completes -- >trigger --> glue_common_python_shell.py ( with arguments --table_name10)

how can I achieve this orchestration in AWS .Please help.

Thanks in advance.

질문됨 2년 전2487회 조회
1개 답변
1
수락된 답변

There are multiple ways to orchestrate the Glue jobs. I will list an example architectures that would work in this scenario:

Use StepFunctions to execute the series of Glue jobs. In the below example, the Lambda function returns all table names and other inputs needed for the Glue job and those can be passed into Glue jobs as noted below. The startJobRun.sync ensures that the next steps starts only after the job is successful and complete.

{
  "Comment": "Run Glue job workflow",
  "StartAt": "Lambda Invoke",
  "States": {
    "Lambda Invoke": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "OutputPath": "$.Payload",
      "Parameters": {
        "Payload.$": "$"
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException"
          ],
          "IntervalSeconds": 2,
          "MaxAttempts": 6,
          "BackoffRate": 2
        }
      ],
      "Next": "run_glue_job_1"
    },
    "run_glue_job_1": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "glue_job_1"
      },
      "Next": "run_glue_job_2",
      "InputPath": "$.table_name_1"
    },
    "run_glue_job_2": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "glue_job_2"
      },
      "Next": "run_glue_job_3",
      "InputPath": "$.table_name_2"
    },
    "run_glue_job_3": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "glue_job_3"
      },
      "End": true,
      "InputPath": "$.table_name_3"
    }
  }
}

Glue workflow does not have the functionality yet to run the same job multiple times with different parameters. You could create multiple workflows and each workflow would call the same job with different parameters and would be triggered based on previous workflow completion.

You could use Lambda's to trigger the job using input parameters and wait for a few minutes and check for job completion then trigger the next job. The problem with this approach is that Lambda would use up compute resources when waiting for the glue job to complete.

Some other ideas may include using Airflow(MWAA) to orchestrate.

profile pictureAWS
답변함 2년 전
AWS
전문가
검토됨 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인