installation of GPU code on parallelcluster

0

I am trying to use miniconda to install OpenMM, and MD engine that utilizes GPUs. My ParallelCluster configuration is as follows:

  • Head node: c5.2xlarge
  • Compute nodes: g4dn.metal (T4 GPUs)
  • Networked EFS storage from EC2 I can install OpenMM with miniconda on the head node, but when I run a test to see if OpenMM works, I get the following error:
python -m openmm.testInstallation

OpenMM Version: 8.0
Git Revision: a7800059645f4471f4b91c21e742fe5aa4513cda

There are 3 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Error computing forces with CUDA platform

CUDA platform error: Error initializing CUDA: CUDA_ERROR_NO_DEVICE (100) at /home/conda/feedstock_root/build_artifacts/openmm_1675115856424/work/platforms/cuda/src/CudaContext.cpp:140

Median difference in forces between platforms:

Reference vs. CPU: 6.2955e-06

All differences are within tolerance.

My head node lacks a GPU, which would explain why the error CUDA_ERROR_NO_DEVICE was given. My next thought was to log in interactively into one of my GPU nodes, either using

salloc --time=30 --account=centos --nodes=1
salloc: Granted job allocation 5

or

srun --pty --mem=1g -n 1 --gres=gpu:1 -J modbind -p modbind /bin/bash

The first try with salloc, even though it shows that I'm in interactive mode, doesn't actually log me into the GPU node (the output of lspci -v is exactly the same before and after). The second try with srun simply hangs.

I've also referenced the GROMACS on AWS workshop using spack https://catalog.workshops.aws/gromacs-on-aws-parallelcluster/en-US, since OpenMM is included in the list of software for spack, but unfortunately the workshop only compiled GROMACS for CPU architecture, not CPU+GPU. Any help would be much appreciated.

blakem
asked a year ago523 views
2 Answers
1
Accepted Answer

HI @blakem,

I can confirm the first issue is due to lack of GPU in the head node. To experiment within one of the compute nodes you can submit a job, retrieve the node hostname and then when the job is Running connect to the node with SSH:

[ec2-user@ip-10-0-0-33 ~]$ sbatch --wrap "sleep 100"
Submitted batch job 1

[ec2-user@ip-10-0-0-33 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1    queue1     wrap ec2-user R       0:03      1 queue1-dy-queue1-t2medium-1

[ec2-user@ip-10-0-0-33 ~]$ ssh queue1-dy-queue1-t2medium-1

Once in the compute node you can try to manually install the package on it. If it works as expected you can automate the installation by using OnNodeConfigured custom bootstrap action: https://docs.aws.amazon.com/parallelcluster/latest/ug/custom-bootstrap-actions-v3.html

Enrico

AWS
answered a year ago
0

@enrico-aws, thanks for the quick turnaround and suggestion. I had to wait for 10+ minutes for my GPU node to finish initializing, but once it was running I was able to log into the GPU node.

blakem
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions