FSx for lustre with AWS Sagemaker Error - Artifact upload failed:Please ensure that the subnet's route table has a route to an S3 VPC endpoint or a NAT device



I am trying to use FSx for Lustre for my Sagemaker training. I followed this tutorial: https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/.


import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator

sess = sagemaker.Session()

estimator = Estimator(image_uri='image-uri',
                      instance_type='ml.m5.xlarge',  # ml.p3.2xlarge
                      security_group_ids=['sg-id', 'sg-id'],

from sagemaker.inputs import FileSystemInput

train_input = FileSystemInput(file_system_id = 'fs-id',
                              file_system_type = 'FSxLustre',
                              directory_path = '/mount-idea/dataset',
                              file_system_access_mode = 'rw')


First issue is that I don't have internet connection in the container, and second is this error:

UnexpectedStatusException: Error for Training job training-job: Failed. Reason: ClientError: Artifact upload failed:Please ensure that the subnet's route table has a route to an S3 VPC endpoint or a NAT device, and both the security groups and the subnet's network ACL allow uploading data to all output URIs

I have create a custom VPC for this with private and public subnets. I enabled NAT and S3 endpoint. I am using one of the public subnet so that I can have access to internet. I created custom security group rules (inbound/outbound) with Custom TCP for port 988 and ports 1018 - 1023 for the security groups I use.

The code is running in a sagemaker notebook that has as VPC, the custom VPC I created and uses same subnet and security groups as the ones pass to the estimator.

What should I do to fix this error?

Thank you!

1 Answer

What are your security group outbound rules?
There is no problem if all communication is permitted.

You may also want to check network ACLs.

Also check if the route to the S3 VPC endpoint is set in the route table.

profile picture
answered a year ago
  • Thank you for responding!

    Security outbound and inbound rules of my-sg-id: Custom TCP TCP 988 my-sg-id Custom TCP TCP 1018 - 1023 my-sg-id IPv4 All traffic All All

    Network ACLs outbound and inbound rules: Rule number Type Protocol Port range Source Allow/Deny 100 All traffic All All Allow

    •                 All traffic	       	All		  All	Deny

    In the subnet I use (subnet-id), this is the route table: Destination Target local igw-0a053be26059959b2 pl-63a5400a vpce-06fd1bab012e09723

    with vpce-06fd1bab012e09723 the S3 endpoint: Service name = com.amazonaws.us-east-1.s3 Endpoint type = Gateway Private DNS names enabled = No Policy = { "Version": "2008-10-17", "Statement": [ { "Effect": "Allow", "Principal": "", "Action": "", "Resource": "*" } ] }

    The subnet I use is a public subnet, is that okay?

    I have made some progress with those configurations and don't get the "S3 VPC endpoint or a NAT device" error anymore. However, I still don't have internet access in the container (tested it with simple request to google.com in train.py).

  • If the container is running on an appropriate public subnet, it would be accessible to the outside world. If you are running on a public subnet and have security groups and network ACLs configured correctly, you may want to check with AWS support.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions