Machine Learning image training on EC2 with GPU

Question

I am training deep learning models with ten thousand images on a G4 GPU instance, using local storage. Using parallel PyTorch dataloaders, just like I do with on-prem GPU hardware. On-prem, GPU utilization is typically a constant 99% during training and varies during validation steps. On EC2, training flips between 30/maybe up to 70% util and back to zero, for an average of maybe 30-40%. Please suggest how to get more GPU utilization in this scenario.

MichaelFischer 專家
2 年前
Just to be clear, by "local storage," do you mean EC2 instance storage, or do you mean the root EBS volume for your instance? The two have very different performance characteristics.

Answer

Hello,

Thank you for posting your question! You may consider below steps to optimize the GPU setting to get the best performance from the GPU: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/optimize_gpu.html

In the above URL you can specify GPU clock speed to maximum frequency depending on instance type.

Machine Learning image training on EC2 with GPU

相關內容