FSx for lustre with AWS Sagemaker Error - Artifact upload failed:Please ensure that the subnet's route table has a route to an S3 VPC endpoint or a NAT device

0

Hi,

I am trying to use FSx for Lustre for my Sagemaker training. I followed this tutorial: https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/.

Code:

import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator

sess = sagemaker.Session()

estimator = Estimator(image_uri='image-uri',
                      role='my-role',
                      base_job_name='training-job',
                      instance_count=1,
                      sagemaker_session=sess,
                      instance_type='ml.m5.xlarge',  # ml.p3.2xlarge
                      subnets=['subnet-id'],
                      security_group_ids=['sg-id', 'sg-id'],
                      hyperparameters={...} 
                     )

from sagemaker.inputs import FileSystemInput

train_input = FileSystemInput(file_system_id = 'fs-id',
                              file_system_type = 'FSxLustre',
                              directory_path = '/mount-idea/dataset',
                              file_system_access_mode = 'rw')

estimator.fit(train_input)

First issue is that I don't have internet connection in the container, and second is this error:

UnexpectedStatusException: Error for Training job training-job: Failed. Reason: ClientError: Artifact upload failed:Please ensure that the subnet's route table has a route to an S3 VPC endpoint or a NAT device, and both the security groups and the subnet's network ACL allow uploading data to all output URIs

I have create a custom VPC for this with private and public subnets. I enabled NAT and S3 endpoint. I am using one of the public subnet so that I can have access to internet. I created custom security group rules (inbound/outbound) with Custom TCP for port 988 and ports 1018 - 1023 for the security groups I use.

The code is running in a sagemaker notebook that has as VPC, the custom VPC I created and uses same subnet and security groups as the ones pass to the estimator.

What should I do to fix this error?

Thank you!

1 回答
0

What are your security group outbound rules?
There is no problem if all communication is permitted.

You may also want to check network ACLs.

Also check if the route to the S3 VPC endpoint is set in the route table.

profile picture
专家
已回答 1 年前
  • Thank you for responding!

    Security outbound and inbound rules of my-sg-id: Custom TCP TCP 988 my-sg-id Custom TCP TCP 1018 - 1023 my-sg-id IPv4 All traffic All All 0.0.0.0/0

    Network ACLs outbound and inbound rules: Rule number Type Protocol Port range Source Allow/Deny 100 All traffic All All 0.0.0.0/0 Allow

    •                 All traffic	       	All		  All		        0.0.0.0/0	Deny
      

    In the subnet I use (subnet-id), this is the route table: Destination Target 10.0.0.0/16 local 0.0.0.0/0 igw-0a053be26059959b2 pl-63a5400a vpce-06fd1bab012e09723

    with vpce-06fd1bab012e09723 the S3 endpoint: Service name = com.amazonaws.us-east-1.s3 Endpoint type = Gateway Private DNS names enabled = No Policy = { "Version": "2008-10-17", "Statement": [ { "Effect": "Allow", "Principal": "", "Action": "", "Resource": "*" } ] }

    The subnet I use is a public subnet, is that okay?

    I have made some progress with those configurations and don't get the "S3 VPC endpoint or a NAT device" error anymore. However, I still don't have internet access in the container (tested it with simple request to google.com in train.py).

  • If the container is running on an appropriate public subnet, it would be accessible to the outside world. If you are running on a public subnet and have security groups and network ACLs configured correctly, you may want to check with AWS support.

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则