FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError

0

We are submitting a Sagemaker Training job with Sagemaker SDK with a custom docker image. The job finishes successfully for EFS FileSystemInput or TrainingInput. Trying to use the FIleSystemInput with FSxLustre configuration leads to the training job dying during the Preparing the instances for training stage:

InternalServerError: We encountered an internal error. Please try again.

This error is persistent upon re-submission.

What we figured out until now:

  • the job errors before the training image is downloaded.
  • specifying an invalid mount point leads to a proper error: ClientError: Unable to mount file system: xxx directory path: yyy. Incorrect mount path. Please ensure the mount path specified exists on the filesystem.
  • the job finishes successfully when running locally with docker-compose (Estimator with instance_type="local").
  • we can mount the FSx file system on an EC2 instance with the TrainingJob's VPC and security group.

How can we narrow the problem down further and get more information about the failure reason? Can you suggest likely problems that could cause this behavior?

Chris
已提問 2 年前檢視次數 418 次
1 個回答
0

InternalServerError means that was an unforeseen error, during training job orchestration, that wasn't mapped to a known user facing error.
You should create an AWS Support case to uncover the root cause.

AWS
已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南