discrepancy between pcluster partition conf and EC2 instance specification

0

Submitting a slurm job with parameters

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu-t4

fails with

sbatch: error: CPU count per node can not be satisfied

The partition comprises g4dn.2xlarge instances with 8 "vCPUs", according to https://aws.amazon.com/ec2/instance-types/g4. Surprisingly to me, /opt/slurm/etc/pcluster/slurm_parallelcluster_gpu-t4_partition.conf includes

NodeName=gpu-t4-dy-g4dn-2xlarge-[1-2] CPUs=4 RealMemory=31129 State=CLOUD Feature=dynamic,g4dn.2xlarge,g4dn-2xlarge,gpu Gres=gpu:t4:1

Should not CPUs= be larger than 4 for this specific instance type?

gefragt vor einem Jahr298 Aufrufe
2 Antworten
0

I searched some more and and found that, according to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html#cpu-options-accelerated, g4dn.2xlarge may not be supporting more than 4 "Valid CPU cores".

beantwortet vor einem Jahr
0

If DisableSimultaneousMultithreading is not specified in the cluster configuration file, the CPU is 8: NodeName=queue1-st-g4dn2xlarge-[1-1] CPUs=8 RealMemory=31129 State=CLOUD Feature=static,g4dn.2xlarge,g4dn2xlarge,gpu Gres=gpu:t4:1

If DisableSimultaneousMultithreading is set to true in the cluster configuration file, the CPU is 4: NodeName=queue1-st-g4dn2xlarge-[1-1] CPUs=4 RealMemory=31129 State=CLOUD Feature=static,g4dn.2xlarge,g4dn2xlarge,gpu Gres=gpu:t4:1

See https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-ComputeResources-DisableSimultaneousMultithreading for more information

beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen