discrepancy between pcluster partition conf and EC2 instance specification

0

Submitting a slurm job with parameters

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu-t4

fails with

sbatch: error: CPU count per node can not be satisfied

The partition comprises g4dn.2xlarge instances with 8 "vCPUs", according to https://aws.amazon.com/ec2/instance-types/g4. Surprisingly to me, /opt/slurm/etc/pcluster/slurm_parallelcluster_gpu-t4_partition.conf includes

NodeName=gpu-t4-dy-g4dn-2xlarge-[1-2] CPUs=4 RealMemory=31129 State=CLOUD Feature=dynamic,g4dn.2xlarge,g4dn-2xlarge,gpu Gres=gpu:t4:1

Should not CPUs= be larger than 4 for this specific instance type?

asked a year ago291 views
2 Answers
0

I searched some more and and found that, according to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html#cpu-options-accelerated, g4dn.2xlarge may not be supporting more than 4 "Valid CPU cores".

answered a year ago
0

If DisableSimultaneousMultithreading is not specified in the cluster configuration file, the CPU is 8: NodeName=queue1-st-g4dn2xlarge-[1-1] CPUs=8 RealMemory=31129 State=CLOUD Feature=static,g4dn.2xlarge,g4dn2xlarge,gpu Gres=gpu:t4:1

If DisableSimultaneousMultithreading is set to true in the cluster configuration file, the CPU is 4: NodeName=queue1-st-g4dn2xlarge-[1-1] CPUs=4 RealMemory=31129 State=CLOUD Feature=static,g4dn.2xlarge,g4dn2xlarge,gpu Gres=gpu:t4:1

See https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-ComputeResources-DisableSimultaneousMultithreading for more information

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions