AWS BATCH GPU Startup latency

0

Hi

So I've been struggling to reduce my AWS batch startup time, it takes 6-7min to start the job and the job itself takes a few seconds, is this normal/expected? (and I don't think it's a provisioning problem because after job is posted when I go to ec2 the instance is initialising)

Resources I'm using:

  • Image-Type: "ECS_AL2_NVIDIA"
  • ECR: Docker image is 4GB
  • EFS: ML Model is saved here and mounted on the instance
  • Instance: G4dn
  • Environment: Spot

Things I've tried:

  1. Building my own ami, reducing the volume size, but the time stays the same
  2. Saving the model on the ami, still no improvements

When checking the ec2 instance it seems to be initialised and ready to go within 3 min, so I don't understand what takes so long, is it the download of the docker image? *edit: I've made my d0cker image to be 10 GB and this doesn't affect startup time at all...

How long does it usually take for a GPU batch job to start? Any suggestions would be appreciated.

1개 답변
0

GPU based instances can take longer to launch than non-GPU instances as the underlying hardware and drivers take longer to become available. The Instance will only show as available one everything on the base instanceis running successfully.

AWS
Barry M
답변함 8달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠