How to overcome IO bottleneck in p3 instances tensorflow training.

0

I am training a model using the p3.8xlarge instance, and I am finding that the GPUs are getting starved and training slows with the gpu utilising at zero for much of the time. It seems to be an io bottleneck. I have tried a number of things an have been unable to resolve it.

The dataset is largest (3Tb) so must be streamed from disc. I am using a tf.dataset with TF records and I believe this aspect is set up correctly. I have worked out that for this particular task on this instance to prevent GPU starvation I need at least 750 Mb/s io speed. in various tests I have never managed to get transfer speeds of between 200 and 250Mb/s, this is both during training and using the tool hdparm.

I have tried gp3 and io2 volumes and maxed out the iops and throughput and this does not appear to increase the maximum achievable throughput speed.

Is it possible with this type of instance to achieve these faster io speeds? can anyone demonstrate how it can be achieved. I have looked at alternative instance types but the price jumps steeply, and the availability is poor (e.g. for the p4 instances). The data on the attached storage is in a xfs formatted volume.

Any help and advice appreciated.

Paul
已提問 6 個月前檢視次數 99 次
沒有答案

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南