Two simultaneous Batch jobs keep spawning multiple instances.

0

Hello,

Intro

I created a AWS Batch job queue, job definition and compute environment for GPU jobs using the AWS Batch Wizard. When I submit a single job, an instance is spun up, it runs and after finishing the job, the instance is shut down. I can then submit a new job.

Problem

When I submit a second job, while the first one is running, this second job is stuck in the "Runnable " state. I can see that multiple new instances are being spawned, however, they all stay in the "Initializing" state. To make things even weirder, when I then manually terminate both jobs, all instances shut down. however, when I then submit a single new job this one will also remain in the "Runnable" state and spawn many new instances. This does not appear to fix itself, unless I run the wizard again to create a new queue/definition/environment.

Additional Info:

The compute environment has a customized launch template to mount a 100GB volume for the container which is specified by the job definition. The job definition requests 8 vCPUs and the compute environment has minvCpus = desiredvCpus = 0 and maxvCpus = 128. The number of desired vCPUs in the compute environment dashboard keeps increasing until the limit of 128. The "extra" instances which are spawned in addition to the single, properly running one, all remain in the "Initializing" state. The keep running until the vCPU limit is reached and running instances start being terminated to "make room for new ones". The EC2 Auto Scaling group which corresponds to the compute environment is constantly showing "Updating Capacity...".

Any help with solving this issue would greatly be appreciated.

posta 23 giorni fa587 visualizzazioni
1 Risposta
0

The issue got fixed by running the wizard again to recreate all resources.

The exact reason why the previous resources were broken I cannot exactly say. After playing around a little bit, I have found that I can reproduce the issue on a previously working resource set up if I simply edit the compute environment once with a seemingly trivial change (e.g.: adding an additional instance type). When examining the JSON file after the change, the only thing that is different (besides the added instance types) are that the update policy settings were added, which were missing before:

"updatePolicy": {
    "terminateJobsOnUpdate": false,
    "jobExecutionTimeoutMinutes": 30
  },

However, these are the default values so I do not understand why this would cause job submissions to break.

con risposta 20 giorni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande