Issue with maximum concurrent runs and job status

1

I have a very simple Glue ETL job configured that has a maximum of 1 concurrent runs allowed. This job works fine when run manually from the AWS console and CLI.

I have some Python code that is designed to run this job periodically against a queue of work that results in different arguments being passed to the job. The Python code starts the job and waits for it to enter the SUCCEEDED state but will abort if it stops, fails, times out, etc.

The relevant snippet is:

        start_response = self.client.start_job_run(JobName=self.jobname, Arguments=formatted_arguments)

        if not wait:
            return

        jobid = start_response['JobRunId']
        log.info("Waiting for glue job %s (%s)", self.jobname, jobid)
        while True:
            state = self.job_status(jobid)
            if state == 'SUCCEEDED':
                log.info("Glue job %s (%s) completed", self.jobname, jobid)
                return
            if state in ['STOPPED', 'FAILED', 'TIMEOUT', 'STOPPING']:
                raise StandardError("Glue job %s (%s) %s" % (self.jobname, jobid, state))
            if not state in ['STARTING', 'RUNNING']:
                raise StandardError("Glue job %s (%s) is in unknown state %s" % (self.jobname, jobid, state))

            log.debug("Waiting for %s (%s), which is %s", self.jobname, jobid, state)
            time.sleep(GLUE_STATUS_INTERVAL)

Unfortunately, seemingly without fail, when the job enters the SUCCEEDED state, if I run that same job again upon entering this state, Glue claims I've hit the maximum concurrent runs (1) for the job in question:

ConcurrentRunsExceededException: An error occurred (ConcurrentRunsExceededException) when calling the StartJobRun operation: Concurrent runs exceeded for <job>

When I look at the console, the job is SUCCEEDED and there are no others running, for this job or otherwise.

I can work around this by sleeping in the right place, but this seems like a workaround for what smells like a bug somewhere. I noticed that the last entry in the log for an example run was 9:54:15 after it claimed the job was SUCCEEDED, but the console says the end time was 9:55 (with no seconds).

Any ideas why I can't start another job immediately after the other completes? Is there some sort of cool down period?

Edited by: jhart-r7 on Jun 20, 2018 10:18 AM

질문됨 6년 전3966회 조회
2개 답변
0

I've been able to reliably work around this by sleeping a minute between consecutive runs of this job. Really feels like there is another state after SUCCEEDED that isn't described in https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-runs.html or there is a cool down period between runs that isn't documented.

Has anyone else experienced this? Ideas to solve this?

답변함 6년 전
0

I faced similar issue recently where in my case I have to run the same job 100 times with different parameter values. Is there any solution for such case. I am using lambda function for this and it has a maximum time out of 15 mins. I could not even finish 3 runs. Please suggest if any alternate is there for this.

답변함 3년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠