How to overcome IO bottleneck in p3 instances tensorflow training.

I am training a model using the p3.8xlarge instance, and I am finding that the GPUs are getting starved and training slows with the gpu utilising at zero for much of the time. It seems to be an io bottleneck. I have tried a number of things an have been unable to resolve it.

The dataset is largest (3Tb) so must be streamed from disc. I am using a tf.dataset with TF records and I believe this aspect is set up correctly. I have worked out that for this particular task on this instance to prevent GPU starvation I need at least 750 Mb/s io speed. in various tests I have never managed to get transfer speeds of between 200 and 250Mb/s, this is both during training and using the tool hdparm.

I have tried gp3 and io2 volumes and maxed out the iops and throughput and this does not appear to increase the maximum achievable throughput speed.

Is it possible with this type of instance to achieve these faster io speeds? can anyone demonstrate how it can be achieved. I have looked at alternative instance types but the price jumps steeply, and the availability is poor (e.g. for the p4 instances). The data on the attached storage is in a xfs formatted volume.

Any help and advice appreciated.

Topics

Compute Machine Learning & AI

Relevant content

Train machine learning model using reserved instance
Accepted Answer
AWS-User-2093412
asked 2 years ago
Importing externally-trained TensorFlow v2 models to SageMaker deployment
EXPERT
Alex_T
asked 2 years ago
Does AWS train their own base model if I train a model derived from it?
Amogh
asked 4 months ago
Sagemaker training for multiclass classification run does not store the trained model
Accepted Answer
arthur
asked 2 years ago
How do I troubleshoot issues when I bring my custom container to Amazon SageMaker for training or inference?
AWS OFFICIALUpdated 2 years ago
How do I resolve Amazon S3 AccessDenied errors in Amazon SageMaker training jobs?
AWS OFFICIALUpdated 2 years ago
Why am I getting the OptionsRequestDenied error when I upload a file using the Amazon S3 console?
AWS OFFICIALUpdated a year ago
How do I troubleshoot errors when running Amazon SageMaker training jobs?
AWS OFFICIALUpdated 2 years ago
Accelerating SageMaker Training Jobs running on AWS Trainium
EXPERT
Kamran Khan
published 2 months ago
Train large language model using Hugging Face and AWS Trainium
EXPERT
Kamran Khan
published a year ago