FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError

0

We are submitting a Sagemaker Training job with Sagemaker SDK with a custom docker image. The job finishes successfully for EFS FileSystemInput or TrainingInput. Trying to use the FIleSystemInput with FSxLustre configuration leads to the training job dying during the Preparing the instances for training stage:

InternalServerError: We encountered an internal error. Please try again.

This error is persistent upon re-submission.

What we figured out until now:

  • the job errors before the training image is downloaded.
  • specifying an invalid mount point leads to a proper error: ClientError: Unable to mount file system: xxx directory path: yyy. Incorrect mount path. Please ensure the mount path specified exists on the filesystem.
  • the job finishes successfully when running locally with docker-compose (Estimator with instance_type="local").
  • we can mount the FSx file system on an EC2 instance with the TrainingJob's VPC and security group.

How can we narrow the problem down further and get more information about the failure reason? Can you suggest likely problems that could cause this behavior?

1 回答
0

InternalServerError means that was an unforeseen error, during training job orchestration, that wasn't mapped to a known user facing error.
You should create an AWS Support case to uncover the root cause.

AWS
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则