FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError

0

We are submitting a Sagemaker Training job with Sagemaker SDK with a custom docker image. The job finishes successfully for EFS FileSystemInput or TrainingInput. Trying to use the FIleSystemInput with FSxLustre configuration leads to the training job dying during the Preparing the instances for training stage:

InternalServerError: We encountered an internal error. Please try again.

This error is persistent upon re-submission.

What we figured out until now:

  • the job errors before the training image is downloaded.
  • specifying an invalid mount point leads to a proper error: ClientError: Unable to mount file system: xxx directory path: yyy. Incorrect mount path. Please ensure the mount path specified exists on the filesystem.
  • the job finishes successfully when running locally with docker-compose (Estimator with instance_type="local").
  • we can mount the FSx file system on an EC2 instance with the TrainingJob's VPC and security group.

How can we narrow the problem down further and get more information about the failure reason? Can you suggest likely problems that could cause this behavior?

1개 답변
0

InternalServerError means that was an unforeseen error, during training job orchestration, that wasn't mapped to a known user facing error.
You should create an AWS Support case to uncover the root cause.

AWS
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠