AWS BATCH GPU Startup latency

0

Hi

So I've been struggling to reduce my AWS batch startup time, it takes 6-7min to start the job and the job itself takes a few seconds, is this normal/expected? (and I don't think it's a provisioning problem because after job is posted when I go to ec2 the instance is initialising)

Resources I'm using:

  • Image-Type: "ECS_AL2_NVIDIA"
  • ECR: Docker image is 4GB
  • EFS: ML Model is saved here and mounted on the instance
  • Instance: G4dn
  • Environment: Spot

Things I've tried:

  1. Building my own ami, reducing the volume size, but the time stays the same
  2. Saving the model on the ami, still no improvements

When checking the ec2 instance it seems to be initialised and ready to go within 3 min, so I don't understand what takes so long, is it the download of the docker image? *edit: I've made my d0cker image to be 10 GB and this doesn't affect startup time at all...

How long does it usually take for a GPU batch job to start? Any suggestions would be appreciated.

1 Answer
0

GPU based instances can take longer to launch than non-GPU instances as the underlying hardware and drivers take longer to become available. The Instance will only show as available one everything on the base instanceis running successfully.

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions