FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError

0

We are submitting a Sagemaker Training job with Sagemaker SDK with a custom docker image. The job finishes successfully for EFS FileSystemInput or TrainingInput. Trying to use the FIleSystemInput with FSxLustre configuration leads to the training job dying during the Preparing the instances for training stage:

InternalServerError: We encountered an internal error. Please try again.

This error is persistent upon re-submission.

What we figured out until now:

  • the job errors before the training image is downloaded.
  • specifying an invalid mount point leads to a proper error: ClientError: Unable to mount file system: xxx directory path: yyy. Incorrect mount path. Please ensure the mount path specified exists on the filesystem.
  • the job finishes successfully when running locally with docker-compose (Estimator with instance_type="local").
  • we can mount the FSx file system on an EC2 instance with the TrainingJob's VPC and security group.

How can we narrow the problem down further and get more information about the failure reason? Can you suggest likely problems that could cause this behavior?

1 Respuesta
0

InternalServerError means that was an unforeseen error, during training job orchestration, that wasn't mapped to a known user facing error.
You should create an AWS Support case to uncover the root cause.

AWS
respondido hace 2 años

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas