ModuleNotFoundError Sagemaker Job

0

I am running a sagemaker job with an entry point of a python script (train.py) that uses pytorch. The python script relies on several other custom python scripts (e.g. my_dataloader.py, my_model.py) that I'm importing. All these scripts are packaged into a tar file in an S3 bucket. When I run the job, the train.py file gets put into a new directory in my S3 bucket within a tar file named sourcedir.tar.gz, but none of the other files it relies upon, which are imported in train.py, are found there. So I end up getting this error: AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "ModuleNotFoundError: No module named 'my_dataloader' " Command "/opt/conda/bin/python3.8 train.py --batch_size 16 --epochs 10", exit code: 1

How can I make sure the Sagemaker job is getting all the files from my original tar file to run the train.py script correctly?

asked 6 months ago47 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions