Machine Learning image training on EC2 with GPU

0

I am training deep learning models with ten thousand images on a G4 GPU instance, using local storage. Using parallel PyTorch dataloaders, just like I do with on-prem GPU hardware. On-prem, GPU utilization is typically a constant 99% during training and varies during validation steps. On EC2, training flips between 30/maybe up to 70% util and back to zero, for an average of maybe 30-40%. Please suggest how to get more GPU utilization in this scenario.

  • Just to be clear, by "local storage," do you mean EC2 instance storage, or do you mean the root EBS volume for your instance? The two have very different performance characteristics.

질문됨 2년 전314회 조회
1개 답변
0

Hello,

Thank you for posting your question! You may consider below steps to optimize the GPU setting to get the best performance from the GPU: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/optimize_gpu.html

In the above URL you can specify GPU clock speed to maximum frequency depending on instance type.

AWS
지원 엔지니어
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠