How to overcome IO bottleneck in p3 instances tensorflow training.

I am training a model using the p3.8xlarge instance, and I am finding that the GPUs are getting starved and training slows with the gpu utilising at zero for much of the time. It seems to be an io bottleneck. I have tried a number of things an have been unable to resolve it.

The dataset is largest (3Tb) so must be streamed from disc. I am using a tf.dataset with TF records and I believe this aspect is set up correctly. I have worked out that for this particular task on this instance to prevent GPU starvation I need at least 750 Mb/s io speed. in various tests I have never managed to get transfer speeds of between 200 and 250Mb/s, this is both during training and using the tool hdparm.

I have tried gp3 and io2 volumes and maxed out the iops and throughput and this does not appear to increase the maximum achievable throughput speed.

Is it possible with this type of instance to achieve these faster io speeds? can anyone demonstrate how it can be achieved. I have looked at alternative instance types but the price jumps steeply, and the availability is poor (e.g. for the p4 instances). The data on the attached storage is in a xfs formatted volume.

Any help and advice appreciated.

主題

運算機器學習與 AI

標籤

Amazon EC2 機器學習與 AI 運算

語言

English

Paul

已提問 6 個月前檢視次數 99 次

沒有答案

最新
最多得票
最多評論

How to overcome IO bottleneck in p3 instances tensorflow training.

相關內容