Python ImportError when running job on AWS parallel cluster

0

Is there a straightforward way to make sure compute nodes have access to python packages installed in the head node of a parallel cluster?

I have a python script to train a ML model with pytorch. I was able to get the script running on the head node. When I try to run the script on a compute node using an sbatch file, the job fails with "ImportError: No module named torch".

The python packages I installed to run the script are in the "/home/ec2-user/.local/lib/python3.8/site-packages/" folder.

asked 6 months ago185 views
1 Answer
0

Hi @gbradford,

We allow SharedStorage options where the Storage is being accessed by HeadNode and compute Node.

You can use external ( exiting storage) or let ParallelCluster create Storage for you ( managed Storage).

From ParallelCluster 3.8.0 onwards we allow the usage of /home as a mount point as well.

Please refer to Docs for More details https://docs.aws.amazon.com/parallelcluster/latest/ug/shared-storage-quotas-integration-v3.html

Thanks

answered 6 months ago
  • Thanks for your answer. I found that the issue was really that the python version on the compute node didn't match the python version on the head node. (I had updated python to 3.8 on the head node) I found that adding sudo amazon-linux-extras enable python3.8 sudo yum install -y python38 to the beginning of my sbatch script solved the issue

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions