What to do after my training job fails with "InternalServerError"?

0

I have a training job with resnet-50 on a 50GB/ml.p2.xlarge (1 instance), Pipe mode, object detection model with 1073 images in training and 268 images in validation. According to CloudWatch, it ran up to epoch 77 (for about 4 hours) but then failed with no specific message recorded in CloudWatch. I only get the dreaded "InternalServerError: We encountered an internal error. Please try again." which is not okay because it costs money and I need to know what is failing.

The CPU utilization is stable at around 270% (a number that needs to be divided by the number of vCPUs which is 4 so really this is about 68% per vCPU), GPU Utilization is constant at under 60%, GPU Memory Utilization is constant at around 18%, Memory Utilization is constant at 3.2%, Disk Utilization is stable at 0.22%.

Is there an obvious mistake I am doing? Thanks for the help!

fascani
feita há um ano90 visualizações
Sem respostas

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas