drained state on slurm in parallel cluster

0

Hi, I am using parallelcluster 3.5.1 with slurm and have set up cluster with 4 queues using c6i-large, c6i-xlarge, c6i-2xlarge and c6i-4xlarge instances in Frankfurt region. Queues are identical, all have disabled HT. I have noticed that the queue using 1024 c6i-4xlarge instances sometimes behaves somehow strangely - even with only 1 job queued. Its nodes tend to fall into states I would not expect it to fall into, namely

  • down~ - I would expect idle~
  • down#, I would expect allocated~ or allocated#
    In addition it takes >20 minutes to spin up the instance whereas for other types it is typically <10 minutes. In the meantime, the bob is in the PD status with some mysterious state description: (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) The output from sinfo and squeue is as follows:
$ sinfo
q8           up   infinite      1  down# q8-dy-c6i-4xlarge-8cpu-32gb-1
q8           up   infinite   1023  down~ q8-dy-c6i-4xlarge-8cpu-32gb-[2-1024]
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               176        q8   310601 flacsclo PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

Can you please help me understand the state the job and node falls in? I have not seen such issues on the development cluster but I am seeing this on staging and I am afraid of cost implications it may have if I deploy the cluster to production environment. Best, Michal

asked a year ago768 views
5 Answers
1
Accepted Answer

Hi @mfolusiak,

from the logs you are hitting insufficient capacity issues: there are not enough available instances in the availability zone you selected to instantiate the compute node. Please refer to the following section of the ParallelCluster troubleshooting guide for hints on how to avoid insufficient capacity issues.

answered a year ago
profile picture
EXPERT
reviewed 10 months ago
0

Can you provide the following logs from the head node? /var/log/parallelcluster/clustermgtd, /var/log/parallelcluster/slurm_resume.log, /var/log/slurmctld.log

For more information , please see our troubleshooting doc: https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting.html

answered a year ago
0

Lof file: /var/log/slurmctld.log:

...
[2023-04-28T13:11:53.848] POWER: Power save mode: 15360 nodes
[2023-04-28T13:22:23.912] POWER: Power save mode: 15360 nodes
[2023-04-28T13:23:52.132] _slurm_rpc_submit_batch_job: JobId=174 InitPrio=4294901586 usec=3574
[2023-04-28T13:23:52.254] _slurm_rpc_submit_batch_job: JobId=175 InitPrio=4294901585 usec=1820
[2023-04-28T13:23:52.366] _slurm_rpc_submit_batch_job: JobId=176 InitPrio=4294901584 usec=1339
[2023-04-28T13:23:52.438] sched/backfill: _start_job: Started JobId=174 in q2 on q2-dy-c6i-xlarge-2cpu-8gb-1
[2023-04-28T13:23:52.444] sched/backfill: _start_job: Started JobId=175 in q4 on q4-dy-c6i-2xlarge-4cpu-16gb-1
[2023-04-28T13:23:52.448] sched/backfill: _start_job: Started JobId=176 in q8 on q8-dy-c6i-4xlarge-8cpu-32gb-1
[2023-04-28T13:23:53.923] POWER: power_save: pid 1303259 waking nodes q2-dy-c6i-xlarge-2cpu-8gb-1,q4-dy-c6i-2xlarge-4cpu-16gb-1,q8-dy-c6i-4xlarge-8cpu-32gb-1
[2023-04-28T13:24:00.130] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1 reason set to: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
[2023-04-28T13:24:00.130] requeue job JobId=176 due to failure of node q8-dy-c6i-4xlarge-8cpu-32gb-1
[2023-04-28T13:24:00.131] Requeuing JobId=176
[2023-04-28T13:24:00.131] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1 state set to DOWN
[2023-04-28T13:24:23.925] POWER: JobId=174 needed resuming but nodes aren't power_save anymore
[2023-04-28T13:24:23.926] POWER: JobId=175 needed resuming but nodes aren't power_save anymore
[2023-04-28T13:24:23.926] POWER: JobId=176 needed resuming but isn't configuring anymore
[2023-04-28T13:24:29.417] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-2 reason set to: (Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity
[2023-04-28T13:24:29.417] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-2 state set to DOWN
[2023-04-28T13:24:29.417] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-3 reason set to: (Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity
[2023-04-28T13:24:29.417] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-3 state set to DOWN
[2023-04-28T13:24:29.417] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-4 reason set to: (Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity
[2023-04-28T13:24:29.417] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-4 state set to DOWN
[2023-04-28T13:24:29.417] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-5 reason set to: (Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity
[2023-04-28T13:24:29.417] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-5 state set to DOWN
...
2023-04-28T13:24:29.592] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1023 state set to DOWN
[2023-04-28T13:24:29.592] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1024 reason set to: (Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity
[2023-04-28T13:24:29.592] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1024 state set to DOWN
[2023-04-28T13:26:35.369] Node q4-dy-c6i-2xlarge-4cpu-16gb-1 now responding
[2023-04-28T13:26:35.377] Node q2-dy-c6i-xlarge-2cpu-8gb-1 now responding
[2023-04-28T13:26:54.528] job_time_limit: Configuration for JobId=174 complete
[2023-04-28T13:26:54.528] Resetting JobId=174 start time for node power up
[2023-04-28T13:26:54.529] job_time_limit: Configuration for JobId=175 complete
[2023-04-28T13:26:54.529] Resetting JobId=175 start time for node power up
[2023-04-28T13:30:57.824] _job_complete: JobId=174 WEXITSTATUS 0
[2023-04-28T13:30:57.825] _job_complete: JobId=174 done
[2023-04-28T13:30:57.921] _job_complete: JobId=175 WEXITSTATUS 0
[2023-04-28T13:30:57.921] _job_complete: JobId=175 done
[2023-04-28T13:32:23.969] POWER: power_save: pid 1305066 suspending nodes q2-dy-c6i-xlarge-2cpu-8gb-1,q4-dy-c6i-2xlarge-4cpu-16gb-1
[2023-04-28T13:32:53.971] POWER: Power save mode: 15357 nodes
[2023-04-28T13:34:29.463] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1 reason set to: Enabling node since insufficient capacity timeout expired
[2023-04-28T13:34:29.463] powering down node q8-dy-c6i-4xlarge-8cpu-32gb-1
[2023-04-28T13:34:29.463] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-2 reason set to: Enabling node since insufficient capacity timeout expired
[2023-04-28T13:34:29.463] power down request repeating for node q8-dy-c6i-4xlarge-8cpu-32gb-2
[2023-04-28T13:34:29.463] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-3 reason set to: Enabling node since insufficient capacity timeout expired
[2023-04-28T13:34:29.463] power down request repeating for node q8-dy-c6i-4xlarge-8cpu-32gb-3
...
[2023-04-28T13:34:29.642] power down request repeating for node q8-dy-c6i-4xlarge-8cpu-32gb-1022
[2023-04-28T13:34:29.642] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1023 reason set to: Enabling node since insufficient capacity timeout expired
[2023-04-28T13:34:29.642] power down request repeating for node q8-dy-c6i-4xlarge-8cpu-32gb-1023
[2023-04-28T13:34:29.642] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1024 reason set to: Enabling node since insufficient capacity timeout expired
[2023-04-28T13:34:29.642] power down request repeating for node q8-dy-c6i-4xlarge-8cpu-32gb-1024
[2023-04-28T13:34:53.989] POWER: power_save: pid 1305560 suspending nodes q8-dy-c6i-4xlarge-8cpu-32gb-[1-1024]
[2023-04-28T13:37:51.043] sched: Allocate JobId=176 NodeList=q8-dy-c6i-4xlarge-8cpu-32gb-1 #CPUs=8 Partition=q8
[2023-04-28T13:37:53.007] POWER: power_save: pid 1306076 waking nodes q8-dy-c6i-4xlarge-8cpu-32gb-1
[2023-04-28T13:37:54.835] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1 reason set to: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
[2023-04-28T13:37:54.835] requeue job JobId=176 due to failure of node q8-dy-c6i-4xlarge-8cpu-32gb-1
[2023-04-28T13:37:54.836] Requeuing JobId=176
[2023-04-28T13:37:54.836] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1 state set to DOWN
[2023-04-28T13:38:23.010] POWER: JobId=176 needed resuming but isn't configuring anymore
[2023-04-28T13:38:29.532] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-2 reason set to: (Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity
[2023-04-28T13:38:29.532] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-2 state set to DOWN
...
[2023-04-28T13:38:29.708] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1023 state set to DOWN
[2023-04-28T13:38:29.708] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1024 reason set to: (Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity
[2023-04-28T13:38:29.708] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1024 state set to DOWN
[2023-04-28T13:43:23.036] POWER: Power save mode: 15359 nodes
[2023-04-28T13:48:29.864] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1 reason set to: Enabling node since insufficient capacity timeout expired
[2023-04-28T13:48:29.864] powering down node q8-dy-c6i-4xlarge-8cpu-32gb-1
[2023-04-28T13:48:29.864] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-2 reason set to: Enabling node since insufficient capacity timeout expired
[2023-04-28T13:48:29.864] power down request repeating for node q8-dy-c6i-4xlarge-8cpu-32gb-2
[2023-04-28T13:48:29.864] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-3 reason set to: Enabling node since insufficient capacity timeout expired
[2023-04-28T13:48:29.864] power down request repeating for node q8-dy-c6i-4xlarge-8cpu-32gb-3
[2023-04-28T13:48:29.864] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-4 reason set to: Enabling node since insufficient capacity timeout expired
...
[2023-04-28T13:48:30.036] power down request repeating for node q8-dy-c6i-4xlarge-8cpu-32gb-1022
[2023-04-28T13:48:30.036] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1023 reason set to: Enabling node since insufficient capacity timeout expired
[2023-04-28T13:48:30.036] power down request repeating for node q8-dy-c6i-4xlarge-8cpu-32gb-1023
[2023-04-28T13:48:30.036] update_node: node q8-dy-c6i-4xlarge-8cpu-32gb-1024 reason set to: Enabling node since insufficient capacity timeout expired
[2023-04-28T13:48:30.036] power down request repeating for node q8-dy-c6i-4xlarge-8cpu-32gb-1024
[2023-04-28T13:48:53.068] POWER: power_save: pid 1308274 suspending nodes q8-dy-c6i-4xlarge-8cpu-32gb-[1-1024]
[2023-04-28T13:51:51.732] sched: Allocate JobId=176 NodeList=q8-dy-c6i-4xlarge-8cpu-32gb-1 #CPUs=8 Partition=q8
[2023-04-28T13:51:53.087] POWER: power_save: pid 1308803 waking nodes q8-dy-c6i-4xlarge-8cpu-32gb-1
[2023-04-28T13:52:23.089] POWER: JobId=176 needed resuming but nodes aren't power_save anymore
[2023-04-28T13:53:53.097] POWER: Power save mode: 15359 nodes
[2023-04-28T13:54:27.620] Node q8-dy-c6i-4xlarge-8cpu-32gb-1 rebooted 134 secs ago
[2023-04-28T13:54:27.620] Node q8-dy-c6i-4xlarge-8cpu-32gb-1 now responding
[2023-04-28T13:54:54.869] job_time_limit: Configuration for JobId=176 complete
[2023-04-28T13:54:54.869] Resetting JobId=176 start time for node power up
[2023-04-28T13:58:58.386] _job_complete: JobId=176 WEXITSTATUS 0
[2023-04-28T13:58:58.386] _job_complete: JobId=176 done
[2023-04-28T14:00:23.134] POWER: power_save: pid 1310484 suspending nodes q8-dy-c6i-4xlarge-8cpu-32gb-1
[2023-04-28T14:04:23.154] POWER: Power save mode: 15360 nodes
[2023-04-28T14:14:53.208] POWER: Power save mode: 15360 nodes
[2023-04-28T14:25:23.261] POWER: Power save mode: 15360 nodes
answered a year ago
0

Log file:/var/log/parallelcluster/clustermgtd:

2023-04-28 13:24:22,885 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-04-28 13:24:23,138 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-04-28 13:24:23,138 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2023-04-28 13:24:28,909 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster
2023-04-28 13:24:29,070 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2023-04-28 13:24:29,355 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2023-04-28 13:24:29,371 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-04-28 13:24:29,406 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'q8': {'c6i-4xlarge-8cpu-32gb': ComputeResourceFailureEvent(timestamp=datetime.datetime(2023, 4, 28, 13, 24, 22, 885860, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired
2023-04-28 13:24:29,597 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2023-04-28 13:25:22,892 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-04-28 13:25:22,895 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-04-28 13:25:23,141 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-04-28 13:25:23,141 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2023-04-28 13:25:28,926 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster
2023-04-28 13:25:29,009 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2023-04-28 13:25:29,241 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2023-04-28 13:25:29,258 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-04-28 13:25:29,294 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'q8': {'c6i-4xlarge-8cpu-32gb': ComputeResourceFailureEvent(timestamp=datetime.datetime(2023, 4, 28, 13, 24, 22, 885860, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired
2023-04-28 13:25:29,296 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2023-04-28 13:26:22,902 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-04-28 13:26:22,904 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-04-28 13:26:23,148 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-04-28 13:26:23,148 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2023-04-28 13:26:28,923 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster
2023-04-28 13:26:29,003 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2023-04-28 13:26:29,233 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2023-04-28 13:26:29,249 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-04-28 13:26:29,286 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'q8': {'c6i-4xlarge-8cpu-32gb': ComputeResourceFailureEvent(timestamp=datetime.datetime(2023, 4, 28, 13, 24, 22, 885860, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired
2023-04-28 13:26:29,288 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2023-04-28 13:27:22,949 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-04-28 13:27:22,951 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-04-28 13:27:23,191 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-04-28 13:27:23,192 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2023-04-28 13:27:28,968 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster
2023-04-28 13:27:29,120 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2023-04-28 13:27:29,377 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2023-04-28 13:27:29,393 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-04-28 13:27:29,429 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'q8': {'c6i-4xlarge-8cpu-32gb': ComputeResourceFailureEvent(timestamp=datetime.datetime(2023, 4, 28, 13, 24, 22, 885860, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired
2023-04-28 13:27:29,430 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2023-04-28 13:28:22,983 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-04-28 13:28:22,985 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-04-28 13:28:23,228 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-04-28 13:28:23,228 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2023-04-28 13:28:29,003 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster
2023-04-28 13:28:29,100 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2023-04-28 13:28:29,325 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2023-04-28 13:28:29,341 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-04-28 13:28:29,377 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'q8': {'c6i-4xlarge-8cpu-32gb': ComputeResourceFailureEvent(timestamp=datetime.datetime(2023, 4, 28, 13, 24, 22, 885860, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired
2023-04-28 13:28:29,378 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2023-04-28 13:29:23,037 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-04-28 13:29:23,039 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-04-28 13:29:23,289 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-04-28 13:29:23,290 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2023-04-28 13:29:29,073 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster
2023-04-28 13:29:29,161 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2023-04-28 13:29:29,378 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2023-04-28 13:29:29,395 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-04-28 13:29:29,434 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'q8': {'c6i-4xlarge-8cpu-32gb': ComputeResourceFailureEvent(timestamp=datetime.datetime(2023, 4, 28, 13, 24, 22, 885860, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired
2023-04-28 13:29:29,436 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2023-04-28 13:30:23,090 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-04-28 13:30:23,092 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-04-28 13:30:23,341 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-04-28 13:30:23,341 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2023-04-28 13:30:29,118 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster
2023-04-28 13:30:29,269 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2023-04-28 13:30:29,522 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2023-04-28 13:30:29,537 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-04-28 13:30:29,573 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'q8': {'c6i-4xlarge-8cpu-32gb': ComputeResourceFailureEvent(timestamp=datetime.datetime(2023, 4, 28, 13, 24, 22, 885860, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired
2023-04-28 13:30:29,574 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2023-04-28 13:31:23,144 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-04-28 13:31:23,145 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-04-28 13:31:23,395 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-04-28 13:31:23,395 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2023-04-28 13:31:29,199 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster
2023-04-28 13:31:29,285 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2023-04-28 13:31:29,504 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2023-04-28 13:31:29,520 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-04-28 13:31:29,556 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'q8': {'c6i-4xlarge-8cpu-32gb': ComputeResourceFailureEvent(timestamp=datetime.datetime(2023, 4, 28, 13, 24, 22, 885860, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired
...
2023-04-28 13:48:29,817 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-04-28 13:48:29,854 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'q8': {'c6i-4xlarge-8cpu-32gb': ComputeResourceFailureEvent(timestamp=datetime.datetime(2023, 4, 28, 13, 38, 23, 342011, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired
2023-04-28 13:48:29,854 - [root:_reset_insufficient_capacity_timeout_expired_nodes] - INFO - Reset the following compute resources because insufficient capacity timeout expired: {'q8': ['c6i-4xlarge-8cpu-32gb']}
2023-04-28 13:48:30,041 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2023-04-28 13:49:23,694 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-04-28 13:49:23,696 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-04-28 13:49:23,936 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-04-28 13:49:23,936 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
answered a year ago
0

Log file: /var/log/parallelcluster/slurm_resume.log:

2023-04-28 13:23:57,893 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2023-04-28 13:23:57,902 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - Database update: COMPLETED
2023-04-28 13:23:57,902 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - Updating DNS records for Z08130203IL2A890KXN0Z - prod-slurm.pcluster.
2023-04-28 13:23:58,473 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - DNS records update: COMPLETED
2023-04-28 13:23:58,474 - [slurm_plugin.instance_manager:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['q8-dy-c6i-4xlarge-8cpu-32gb-1']
2023-04-28 13:23:58,474 - [slurm_plugin.fleet_manager:run_instances] - INFO - Launching instances with run_instances API. Parameters: {'MinCount': 1, 'MaxCount': 1, 'LaunchTemplate': {'LaunchTemplateName': 'prod-slurm-q8-c6i-4xlarge-8cpu-32gb', 'Version': '$Latest'}}
2023-04-28 13:24:00,121 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Failed RunInstances request: a9e86274-b8b2-43b1-8598-b6ea9e78f315
2023-04-28 13:24:00,121 - [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['q8-dy-c6i-4xlarge-8cpu-32gb-1']: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 1): We currently do not have sufficient c6i.4xlarge capacity in the Availability Zone you requested (eu-central-1a). Our system will be working on provisioning additional capacity. You can currently get c6i.4xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1b, eu-central-1c.
2023-04-28 13:24:00,122 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x2) ['q2-dy-c6i-xlarge-2cpu-8gb-1', 'q4-dy-c6i-2xlarge-4cpu-16gb-1']
2023-04-28 13:24:00,122 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to down: (x1) ['q8-dy-c6i-4xlarge-8cpu-32gb-1']
2023-04-28 13:24:00,122 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state: (x1) ['q8-dy-c6i-4xlarge-8cpu-32gb-1']
2023-04-28 13:24:00,141 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
2023-04-28 13:37:53,144 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-04-28 13:37:53,144 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-04-28 13:37:53,145 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='eu-central-1', cluster_name='prod-slurm', dynamodb_table='parallelcluster-slurm-prod-slurm', hosted_zone='Z08130203IL2A890KXN0Z', dns_domain='prod-slurm.pcluster.', use_private_hostname=False, head_node_private_ip='172.31.30.213', head_node_hostname='ip-172-31-30-213.eu-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, fleet_config={'q1': {'c6i-large-1cpu-4gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.large'}]}}, 'q2': {'c6i-xlarge-2cpu-8gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.xlarge'}]}}, 'q4': {'c6i-2xlarge-4cpu-16gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.2xlarge'}]}}, 'q8': {'c6i-4xlarge-8cpu-32gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.4xlarge'}]}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7fc2ae177d60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')
2023-04-28 13:37:53,145 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='eu-central-1', cluster_name='prod-slurm', dynamodb_table='parallelcluster-slurm-prod-slurm', hosted_zone='Z08130203IL2A890KXN0Z', dns_domain='prod-slurm.pcluster.', use_private_hostname=False, head_node_private_ip='172.31.30.213', head_node_hostname='ip-172-31-30-213.eu-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, fleet_config={'q1': {'c6i-large-1cpu-4gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.large'}]}}, 'q2': {'c6i-xlarge-2cpu-8gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.xlarge'}]}}, 'q4': {'c6i-2xlarge-4cpu-16gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.2xlarge'}]}}, 'q8': {'c6i-4xlarge-8cpu-32gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.4xlarge'}]}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7fc2ae177d60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')
2023-04-28 13:37:53,149 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-04-28 13:37:29.461969+00:00
2023-04-28 13:37:53,149 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q8-dy-c6i-4xlarge-8cpu-32gb-1
2023-04-28 13:37:53,241 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: prod-slurm-RoleHeadNode-17PZU4MOXD3MA
2023-04-28 13:37:53,274 - [slurm_plugin.instance_manager:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['q8-dy-c6i-4xlarge-8cpu-32gb-1']
2023-04-28 13:37:53,274 - [slurm_plugin.fleet_manager:run_instances] - INFO - Launching instances with run_instances API. Parameters: {'MinCount': 1, 'MaxCount': 1, 'LaunchTemplate': {'LaunchTemplateName': 'prod-slurm-q8-c6i-4xlarge-8cpu-32gb', 'Version': '$Latest'}}
2023-04-28 13:37:54,826 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Failed RunInstances request: 8c47735f-1a24-420d-acab-d6b109cdadca
2023-04-28 13:37:54,827 - [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['q8-dy-c6i-4xlarge-8cpu-32gb-1']: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 1): We currently do not have sufficient c6i.4xlarge capacity in the Availability Zone you requested (eu-central-1a). Our system will be working on provisioning additional capacity. You can currently get c6i.4xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1b, eu-central-1c.
2023-04-28 13:37:54,827 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x0) []
2023-04-28 13:37:54,827 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to down: (x1) ['q8-dy-c6i-4xlarge-8cpu-32gb-1']
2023-04-28 13:37:54,827 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state: (x1) ['q8-dy-c6i-4xlarge-8cpu-32gb-1']
2023-04-28 13:37:54,845 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
2023-04-28 13:51:53,225 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-04-28 13:51:53,226 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-04-28 13:51:53,226 - [slurm_plugin.resume:_get_config] - INFO - SlurmResumeConfig(region='eu-central-1', cluster_name='prod-slurm', dynamodb_table='parallelcluster-slurm-prod-slurm', hosted_zone='Z08130203IL2A890KXN0Z', dns_domain='prod-slurm.pcluster.', use_private_hostname=False, head_node_private_ip='172.31.30.213', head_node_hostname='ip-172-31-30-213.eu-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, fleet_config={'q1': {'c6i-large-1cpu-4gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.large'}]}}, 'q2': {'c6i-xlarge-2cpu-8gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.xlarge'}]}}, 'q4': {'c6i-2xlarge-4cpu-16gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.2xlarge'}]}}, 'q8': {'c6i-4xlarge-8cpu-32gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.4xlarge'}]}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7f497533bd60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')
2023-04-28 13:51:53,227 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='eu-central-1', cluster_name='prod-slurm', dynamodb_table='parallelcluster-slurm-prod-slurm', hosted_zone='Z08130203IL2A890KXN0Z', dns_domain='prod-slurm.pcluster.', use_private_hostname=False, head_node_private_ip='172.31.30.213', head_node_hostname='ip-172-31-30-213.eu-central-1.compute.internal', max_batch_size=500, update_node_address=True, all_or_nothing_batch=False, fleet_config={'q1': {'c6i-large-1cpu-4gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.large'}]}}, 'q2': {'c6i-xlarge-2cpu-8gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.xlarge'}]}}, 'q4': {'c6i-2xlarge-4cpu-16gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.2xlarge'}]}}, 'q8': {'c6i-4xlarge-8cpu-32gb': {'Api': 'run-instances', 'Instances': [{'InstanceType': 'c6i.4xlarge'}]}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7f497533bd60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf')
2023-04-28 13:51:53,231 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-04-28 13:51:29.978188+00:00
2023-04-28 13:51:53,231 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q8-dy-c6i-4xlarge-8cpu-32gb-1
2023-04-28 13:51:53,324 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: prod-slurm-RoleHeadNode-17PZU4MOXD3MA
2023-04-28 13:51:53,356 - [slurm_plugin.instance_manager:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['q8-dy-c6i-4xlarge-8cpu-32gb-1']
2023-04-28 13:51:53,356 - [slurm_plugin.fleet_manager:run_instances] - INFO - Launching instances with run_instances API. Parameters: {'MinCount': 1, 'MaxCount': 1, 'LaunchTemplate': {'LaunchTemplateName': 'prod-slurm-q8-c6i-4xlarge-8cpu-32gb', 'Version': '$Latest'}}
2023-04-28 13:51:54,842 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('q8-dy-c6i-4xlarge-8cpu-32gb-1', EC2Instance(id='i-006494fbb2ade8a55', private_ip='172.31.79.226', hostname='ip-172-31-79-226', launch_time=datetime.datetime(2023, 4, 28, 13, 51, 54, tzinfo=tzlocal()), slurm_node=None))"]
2023-04-28 13:51:54,843 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2023-04-28 13:51:54,876 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - Database update: COMPLETED
2023-04-28 13:51:54,876 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - Updating DNS records for Z08130203IL2A890KXN0Z - prod-slurm.pcluster.
2023-04-28 13:51:55,613 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - DNS records update: COMPLETED
2023-04-28 13:51:55,614 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['q8-dy-c6i-4xlarge-8cpu-32gb-1']
2023-04-28 13:51:55,615 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
2023-04-28 15:51:53,856 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions