How to overcome IO bottleneck in p3 instances tensorflow training.

0

I am training a model using the p3.8xlarge instance, and I am finding that the GPUs are getting starved and training slows with the gpu utilising at zero for much of the time. It seems to be an io bottleneck. I have tried a number of things an have been unable to resolve it.

The dataset is largest (3Tb) so must be streamed from disc. I am using a tf.dataset with TF records and I believe this aspect is set up correctly. I have worked out that for this particular task on this instance to prevent GPU starvation I need at least 750 Mb/s io speed. in various tests I have never managed to get transfer speeds of between 200 and 250Mb/s, this is both during training and using the tool hdparm.

I have tried gp3 and io2 volumes and maxed out the iops and throughput and this does not appear to increase the maximum achievable throughput speed.

Is it possible with this type of instance to achieve these faster io speeds? can anyone demonstrate how it can be achieved. I have looked at alternative instance types but the price jumps steeply, and the availability is poor (e.g. for the p4 instances). The data on the attached storage is in a xfs formatted volume.

Any help and advice appreciated.

Paul
asked 6 months ago99 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions