discrepancy between pcluster partition conf and EC2 instance specification

0

Submitting a slurm job with parameters

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu-t4

fails with

sbatch: error: CPU count per node can not be satisfied

The partition comprises g4dn.2xlarge instances with 8 "vCPUs", according to https://aws.amazon.com/ec2/instance-types/g4. Surprisingly to me, /opt/slurm/etc/pcluster/slurm_parallelcluster_gpu-t4_partition.conf includes

NodeName=gpu-t4-dy-g4dn-2xlarge-[1-2] CPUs=4 RealMemory=31129 State=CLOUD Feature=dynamic,g4dn.2xlarge,g4dn-2xlarge,gpu Gres=gpu:t4:1

Should not CPUs= be larger than 4 for this specific instance type?

preguntada hace un año297 visualizaciones
2 Respuestas
0

I searched some more and and found that, according to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html#cpu-options-accelerated, g4dn.2xlarge may not be supporting more than 4 "Valid CPU cores".

respondido hace un año
0

If DisableSimultaneousMultithreading is not specified in the cluster configuration file, the CPU is 8: NodeName=queue1-st-g4dn2xlarge-[1-1] CPUs=8 RealMemory=31129 State=CLOUD Feature=static,g4dn.2xlarge,g4dn2xlarge,gpu Gres=gpu:t4:1

If DisableSimultaneousMultithreading is set to true in the cluster configuration file, the CPU is 4: NodeName=queue1-st-g4dn2xlarge-[1-1] CPUs=4 RealMemory=31129 State=CLOUD Feature=static,g4dn.2xlarge,g4dn2xlarge,gpu Gres=gpu:t4:1

See https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-ComputeResources-DisableSimultaneousMultithreading for more information

respondido hace un año

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas