How to overcome IO bottleneck in p3 instances tensorflow training.

0

I am training a model using the p3.8xlarge instance, and I am finding that the GPUs are getting starved and training slows with the gpu utilising at zero for much of the time. It seems to be an io bottleneck. I have tried a number of things an have been unable to resolve it.

The dataset is largest (3Tb) so must be streamed from disc. I am using a tf.dataset with TF records and I believe this aspect is set up correctly. I have worked out that for this particular task on this instance to prevent GPU starvation I need at least 750 Mb/s io speed. in various tests I have never managed to get transfer speeds of between 200 and 250Mb/s, this is both during training and using the tool hdparm.

I have tried gp3 and io2 volumes and maxed out the iops and throughput and this does not appear to increase the maximum achievable throughput speed.

Is it possible with this type of instance to achieve these faster io speeds? can anyone demonstrate how it can be achieved. I have looked at alternative instance types but the price jumps steeply, and the availability is poor (e.g. for the p4 instances). The data on the attached storage is in a xfs formatted volume.

Any help and advice appreciated.

Paul
preguntada hace 6 meses99 visualizaciones
No hay respuestas

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas