Machine Learning image training on EC2 with GPU

0

I am training deep learning models with ten thousand images on a G4 GPU instance, using local storage. Using parallel PyTorch dataloaders, just like I do with on-prem GPU hardware. On-prem, GPU utilization is typically a constant 99% during training and varies during validation steps. On EC2, training flips between 30/maybe up to 70% util and back to zero, for an average of maybe 30-40%. Please suggest how to get more GPU utilization in this scenario.

  • Just to be clear, by "local storage," do you mean EC2 instance storage, or do you mean the root EBS volume for your instance? The two have very different performance characteristics.

preguntada hace 2 años314 visualizaciones
1 Respuesta
0

Hello,

Thank you for posting your question! You may consider below steps to optimize the GPU setting to get the best performance from the GPU: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/optimize_gpu.html

In the above URL you can specify GPU clock speed to maximum frequency depending on instance type.

AWS
INGENIERO DE SOPORTE
respondido hace 2 años

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas