FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError

0

We are submitting a Sagemaker Training job with Sagemaker SDK with a custom docker image. The job finishes successfully for EFS FileSystemInput or TrainingInput. Trying to use the FIleSystemInput with FSxLustre configuration leads to the training job dying during the Preparing the instances for training stage:

InternalServerError: We encountered an internal error. Please try again.

This error is persistent upon re-submission.

What we figured out until now:

  • the job errors before the training image is downloaded.
  • specifying an invalid mount point leads to a proper error: ClientError: Unable to mount file system: xxx directory path: yyy. Incorrect mount path. Please ensure the mount path specified exists on the filesystem.
  • the job finishes successfully when running locally with docker-compose (Estimator with instance_type="local").
  • we can mount the FSx file system on an EC2 instance with the TrainingJob's VPC and security group.

How can we narrow the problem down further and get more information about the failure reason? Can you suggest likely problems that could cause this behavior?

1 Answer
0

InternalServerError means that was an unforeseen error, during training job orchestration, that wasn't mapped to a known user facing error.
You should create an AWS Support case to uncover the root cause.

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions