What to do after my training job fails with "InternalServerError"?

0

I have a training job with resnet-50 on a 50GB/ml.p2.xlarge (1 instance), Pipe mode, object detection model with 1073 images in training and 268 images in validation. According to CloudWatch, it ran up to epoch 77 (for about 4 hours) but then failed with no specific message recorded in CloudWatch. I only get the dreaded "InternalServerError: We encountered an internal error. Please try again." which is not okay because it costs money and I need to know what is failing.

The CPU utilization is stable at around 270% (a number that needs to be divided by the number of vCPUs which is 4 so really this is about 68% per vCPU), GPU Utilization is constant at under 60%, GPU Memory Utilization is constant at around 18%, Memory Utilization is constant at 3.2%, Disk Utilization is stable at 0.22%.

Is there an obvious mistake I am doing? Thanks for the help!

fascani
asked a year ago83 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions