Hi
So I've been struggling to reduce my AWS batch startup time, it takes 6-7min to start the job and the job itself takes a few seconds, is this normal/expected? (and I don't think it's a provisioning problem because after job is posted when I go to ec2 the instance is initialising)
Resources I'm using:
- Image-Type: "ECS_AL2_NVIDIA"
- ECR: Docker image is 4GB
- EFS: ML Model is saved here and mounted on the instance
- Instance: G4dn
- Environment: Spot
Things I've tried:
- Building my own ami, reducing the volume size, but the time stays the same
- Saving the model on the ami, still no improvements
When checking the ec2 instance it seems to be initialised and ready to go within 3 min, so I don't understand what takes so long, is it the download of the docker image? *edit: I've made my d0cker image to be 10 GB and this doesn't affect startup time at all...
How long does it usually take for a GPU batch job to start?
Any suggestions would be appreciated.