By using AWS re:Post, you agree to the Terms of Use

Questions tagged with Amazon FSx for Lustre

Sort by most recent
  • 1
  • 2
  • 12 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError

We are submitting a Sagemaker Training job with Sagemaker SDK with a custom docker image. The job finishes successfully for EFS [FileSystemInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.FileSystemInput) or [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput). Trying to use the FIleSystemInput with FSxLustre configuration leads to the training job dying during the `Preparing the instances for training` stage: ``` InternalServerError: We encountered an internal error. Please try again. ``` This error is persistent upon re-submission. What we figured out until now: - the job errors before the training image is downloaded. - specifying an invalid mount point leads to a proper error: ```ClientError: Unable to mount file system: xxx directory path: yyy. Incorrect mount path. Please ensure the mount path specified exists on the filesystem.``` - the job finishes successfully when running locally with docker-compose ([Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase) with `instance_type="local"`). - we can mount the FSx file system on an EC2 instance with the TrainingJob's VPC and security group. How can we narrow the problem down further and get more information about the failure reason? Can you suggest likely problems that could cause this behavior?
1
answers
0
votes
56
views
asked 3 months ago

Creating custom YAML files for AWS Parallel Cluster

I am trying to follow the tutorial for running FDS/SMV on AWS Parallel Cluster here: https://aws.amazon.com/blogs/compute/fire-dynamics-simulation-cfd-workflow-using-aws-parallelcluster-elastic-fabric-adapter-amazon-fsx-for-lustre-and-nice-dcv/ . I get to the point where it asks me to setup a config file with the following data: ``` [aws] aws_region_name = <AWS-REGION> [global] sanity_check = true cluster_template = fds-smv-cluster update_check = true [vpc public] vpc_id = vpc-<VPC-ID> m ster_subnet_id = subnet-<SUBNET-ID> [cluster fds-smv-cluster] key_name = <Key-Name> vpc_settings = public compute_instance_type=c5n.18xlarge m ster_instance_type=c5.xlarge initial_queue_size = 0 max_queue_size = 100 scheduler=slurm cluster_type = ondemand s3_read_write_resource=arn:aws:s3:::fds-smv-bucket-unique* placement_group = DYNAMIC placement = compute base_os = alinux2 tags = {"Name" : "fds-smv"} disable_hyperthreading = true fsx_settings = fsxshared enable_efa = compute dcv_settings = hpc-dcv [dcv hpc-dcv] enable = m ster [fsx fsxshared] shared_dir = /fsx storage_capacity = 1200 import_path = s3://fds-smv-bucket-unique imported_file_chunk_size = 1024 export_path = s3://fds-smv-bucket-unique [aliases] ssh = ssh {CFN_USER}@{M STER_IP} {ARGS} ``` I am unable to create a YAML file that will be accepted by Parallel Cluster to create-cluster. It returns the error: ``` { "message": "Bad Request: Configuration must be a valid YAML document" } ``` I attempted to create a YAML file using AWS Parallel Cluster configure wizard (https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-configuring.html) but it does not have all the specifications, like a shared s3 bucket in Fsx, like the tutorial asks for. I tried starting with the wizard created configuration file and editing it so it looks like the YAML files in documentation, but it still returns the same error. See my edited YAML file here: ``` Region: us-east-2 Image: Os: alinux2 HeadNode: InstanceType: c5.xlarge Networking: SubnetId: subnet-032f3e6409362aff2 Ssh: KeyName: MyKeyPair1 DisableSimultaneousMultithreading: true Scheduling: Scheduler: slurm SlurmQueues: - Name: queue1 CapacityType: ONDEMAND ComputeResources: - Name: c5n18xlarge InstanceType: c5n.18xlarge MinCount: 0 MaxCount: 100 Efa: Enabled: true Networking: SubnetIds: - subnet-032f3e6409362aff2 Iam: S3Access: - BucketName: fds-smv-jts-bucket-1 EnableWriteAccess: True AdditionalIamPolicies: - Policy: arn:aws:s3:::fds-smv-jts-bucket-1* SharedStorage: - MountDir: /fsx StorageType: FsxLustre FsxLustreSettings: StorageCapacity: 1200 ImportedFileChunkSize: 1024 ExportPath: s3://fds-smv-jts-bucket-1 ImportPath: s3://fds-smv-jts-bucket-1 Tags: - Key: String Value: fds-smv DevSettings ClusterTemplate: fds-smv-cluster ``` Any ideas on how to create the proper YAML file with all the data that is requested for the tutorial? Thank you!
1
answers
0
votes
248
views
asked 8 months ago

What value should I set for directory_path for the Amazon SageMaker SDK with FSx as data source?

What value should I set for the **directory_path** parameter in **FileSystemInput** for the Amazon SageMaker SDK? Here is some information about my Amazon FSx for Lustre file system: - My FSx ID is `fs-0684xxxxxxxxxxx`. - My FSx has the mount name `lhskdbmv`. - The FSx maps to an Amazon S3 bucket with files (without extra prefixes in their keys) My attempts to describe the job and the results are the following: **Attempt 1:** fs = FileSystemInput( file_system_id='fs-0684xxxxxxxxxxx', file_system_type='FSxLustre', directory_path='lhskdbmv', file_system_access_mode='ro') **Result:** `estimator.fit(fs)` returns `ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: FileSystem DirectoryPath 'lhskdbmv' for channel 'training' is not absolute or normalized. Please ensure you don't have a trailing "/", and/or "..", ".", "//" in the path.` **Attempt 2:** fs = FileSystemInput( file_system_id='fs-0684xxxxxxxxxxx', file_system_type='FSxLustre', directory_path='/', file_system_access_mode='ro') **Result:** `ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: The directory path for FSx Lustre file system fs-068406952bf758bac is invalid. The directory path must begin with mount name of the file system.` **Attempt 3:** fs = FileSystemInput( file_system_id='fs-0684xxxxxxxxxxx', file_system_type='FSxLustre', directory_path='fsx', file_system_access_mode='ro') **Result:** ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: FileSystem DirectoryPath 'fsx' for channel 'training' is not absolute or normalized. Please ensure you don't have a trailing "/", and/or "..", ".", "//" in the path.
1
answers
1
votes
79
views
EXPERT
asked 2 years ago
  • 1
  • 2
  • 12 / page