Questions tagged with High Performance Compute

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

2
answers
0
votes
40
views
asked a year ago

AWS Parallel cluster compute nodes failing to start properly

Hello, I am a new parallelCluster 2.11 user and am having an issue where my compute nodes fail to spin up properly resulting in the eventual failure of pcluster create. Here is my config file: Note: I replaced square brackets with curly braces because aws forums recognizes square brackets as links {aws} aws_region_name = us-east-1 {aliases} ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS} {global} cluster_template = default update_check = true sanity_check = true {cluster default} key_name = <my-keypair> scheduler = slurm master_instance_type = c5n.2xlarge base_os = centos7 vpc_settings = default queue_settings = compute master_root_volume_size = 1000 compute_root_volume_size = 35 {vpc default} vpc_id = <my-default-vpc-id> master_subnet_id = <my-subneta> compute_subnet_id = <my-subnetb> use_public_ips = false {queue compute} enable_efa = true compute_resource_settings = default compute_type = ondemand placement_group = DYNAMIC disable_hyperthreading = true {compute_resource default} instance_type = c5n.18xlarge initial_count = 1 min_count = 1 max_count = 32 {ebs shared} shared_dir = shared volume_type = st1 volume_size = 500 When I run pcluster create I get the following error after ~15 min: The following resource(s) failed to create: MasterServer. - AWS::EC2::Instance MasterServer Failed to receive 1 resource signal(s) within the specified duration If I log into the master node before the failure above I see the following in the /var/log/parallelcluster/clustermgtd log file: 2021-09-28 15:42:41,168 - slurm_plugin.clustermgtd:_maintain_nodes - INFO - Found the following unhealthy static nodes: (x1) 'compute-st-c5n18xlarge-1(compute-st-c5n18xlarge-1)' 2021-09-28 15:42:41,168 - slurm_plugin.clustermgtd:_handle_unhealthy_static_nodes - INFO - Setting unhealthy static nodes to DOWN However, despite setting the node to DOWN, the ec2 compute instance continues to stay in the running state and the above log continually emits the following message: 2021-09-28 15:54:41,156 - slurm_plugin.clustermgtd:_maintain_nodes - INFO - Following nodes are currently in replacement: (x1) 'compute-st-c5n18xlarge-1' This state persists until the pcluster create command fails with the error noted above. I suspect there is something wrong with my configuration -- any help or further troubleshooting advice would be appreciated. Edited by: notknottheory on Sep 28, 2021 9:19 AM
2
answers
0
votes
34
views
asked a year ago

How to update cluster config when the original ebs snapshot is gone

Hi, I have a cluster configured with ParallelCluster 2.10 that has been for over half a year now. It has two ebs resources mounted /shared and /install. It seems that both the ebs snapshots associated with the mounting points have been deleted. This should not be an issue, since the snapshots are used only for the initialization of the cluster, however, when I am trying to update the configuration of the cluster now - simply adding some compute nodes(bumping the max_queue_size), I am facing the following error message: <code> (venv_aws) > pcluster update flacscloudHPC-2-10-0 -c ./config_flacscloudHPC Retrieving configuration from CloudFormation for cluster flacscloudHPC-2-10-0... Validating configuration file ./config_flacscloudHPC... WARNING: The configuration parameter 'scheduler' generated the following warnings: The job scheduler you are using (torque) is scheduled to be deprecated in future releases of ParallelCluster. More information is available here: https://github.com/aws/aws-parallelcluster/wiki/Deprecation-of-SGE-and-Torque-in-ParallelCluster ERROR: The section \[ebs custom2] is wrongly configured The snapshot snap-0870f8601759ca239 does not appear to exist: The snapshot 'snap-0870f8601759ca239' does not exist. </code> How can I update the max_queue_size without having the original snapshod 'snap-0870f8601759ca239'? Is it safe to forcefully reconfigure the cluster with some updated, existing snapshots?
2
answers
0
votes
16
views
asked a year ago

torque service exits with status 3 on master node

Hi, I noticed strange behavior of my cluster. I am using torque on centos 8. The cluster was working fine for over 2 months and all of the sudden the compute nodes stopped running queued jobs. I tried restarting the compute fleet but this didn't help and I found out that the torque service on the master node had failed and I am not able to restart it (see listing below). What can I do to repair my cluster? I am using parallelcluster 2.10 with a custom AMI and maximum 12 nodes with 8 processors on each (c5.4xlarge without hyperthreading). \[code] \[centos@ip-172-31-24-41 ~]$ sudo service --status-all Usage: /etc/init.d/ec2blkdev {start|stop} \u25cf munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago Docs: man:munged(8) Main PID: 5974 (munged) Tasks: 4 (limit: 47239) Memory: 3.5M CGroup: /system.slice/munge.service \u2514\u25005974 /usr/sbin/munged \u25cf pbs_sched.service - SYSV: PBS is a batch versatile batch system for SMPs and clusters Loaded: loaded (/etc/rc.d/init.d/pbs_sched; generated) Active: active (exited) since Tue 2020-12-01 22:04:05 UTC; 2 months 17 days ago Docs: man:systemd-sysv-generator(8) Tasks: 0 (limit: 47239) Memory: 0B CGroup: /system.slice/pbs_sched.service \u25cf pbs_server.service - TORQUE pbs_server daemon Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2020-12-01 22:04:04 UTC; 2 months 17 days ago Main PID: 6173 (code=exited, status=3) jobwatcher RUNNING pid 6467, uptime 78 days, 17:53:12 sqswatcher RUNNING pid 6468, uptime 78 days, 17:53:11 \u25cf trqauthd.service - TORQUE trqauthd daemon Loaded: loaded (/usr/lib/systemd/system/trqauthd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago Main PID: 5900 (trqauthd) Tasks: 1 (limit: 47239) Memory: 3.1M CGroup: /system.slice/trqauthd.service \u2514\u25005900 /opt/torque/sbin/trqauthd -F \[centos@ip-172-31-24-41 ~]$ sudo service pbs_server restart Restarting pbs_server (via systemctl): \[ OK ] \[centos@ip-172-31-24-41 ~]$ sudo service --status-all Usage: /etc/init.d/ec2blkdev {start|stop} \u25cf munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago Docs: man:munged(8) Main PID: 5974 (munged) Tasks: 4 (limit: 47239) Memory: 3.5M CGroup: /system.slice/munge.service \u2514\u25005974 /usr/sbin/munged \u25cf pbs_sched.service - SYSV: PBS is a batch versatile batch system for SMPs and clusters Loaded: loaded (/etc/rc.d/init.d/pbs_sched; generated) Active: active (exited) since Tue 2020-12-01 22:04:05 UTC; 2 months 17 days ago Docs: man:systemd-sysv-generator(8) Tasks: 0 (limit: 47239) Memory: 0B CGroup: /system.slice/pbs_sched.service \u25cf pbs_server.service - TORQUE pbs_server daemon Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2021-02-18 16:18:48 UTC; 7s ago Process: 2884631 ExecStart=/opt/torque/sbin/pbs_server -F -d $PBS_HOME $PBS_ARGS (code=exited, status=3) Main PID: 2884631 (code=exited, status=3) Feb 18 16:18:48 ip-172-31-24-41 systemd\[1]: Started TORQUE pbs_server daemon. Feb 18 16:18:48 ip-172-31-24-41 pbs_server\[2884631]: pbs_server port already bound: Address already in use Feb 18 16:18:48 ip-172-31-24-41 systemd\[1]: pbs_server.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Feb 18 16:18:48 ip-172-31-24-41 systemd\[1]: pbs_server.service: Failed with result 'exit-code'. jobwatcher RUNNING pid 6467, uptime 78 days, 18:14:44 sqswatcher RUNNING pid 6468, uptime 78 days, 18:14:43 \u25cf trqauthd.service - TORQUE trqauthd daemon Loaded: loaded (/usr/lib/systemd/system/trqauthd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago Main PID: 5900 (trqauthd) Tasks: 1 (limit: 47239) Memory: 3.1M CGroup: /system.slice/trqauthd.service \u2514\u25005900 /opt/torque/sbin/trqauthd -F \[/code]
2
answers
0
votes
60
views
asked 2 years ago

torque nodes overloaded with TSK greater than NP

Hello, I noticed that nodes in my cluster tend to overcommit and are overloaded running more torque jobs than the number of available CPUs. I suspect it may be related to the torque configuration (or maybe it doesn't respect hyperthreading somehow?) I am using parallelcluster 2.10 with a custom AMI and maximum 12 nodes with 8 processors on each (c5.4xlarge without hyperthreading). The node I would be analyzing here is **ip-172-31-68-184** This is the qnodes output for this node, should be allowing up to np=8 CPUs \[code] $ qnodes ... ip-172-31-68-184 state = free power_state = Running np = 8 ntype = cluster jobs = 0/218.ip-172-31-24-41.eu-central-1.compute.internal,1/219.ip-172-31-24-41.eu-central-1.compute.internal,2/220.ip-172-31-24-41.eu-central-1.compute.internal,3/221.ip-172-31-24-41.eu-central-1.compute.internal,4/518.ip-172-31-24-41.eu-central-1.compute.internal status = opsys=linux,uname=Linux ip-172-31-68-184 4.18.0-193.28.1.el8_2.x86_64 #1 SMP Thu Oct 22 00:20:22 UTC 2020 x86_64,sessions=1182 1306 5674 6030 6039 6046 6062 112846,nsessions=8,nusers=4,idletime=166759,totmem=31720500kb,availmem=29305472kb,physmem=31720500kb,ncpus=8,loadave=18.33,gres=,netload=47638299866,state=free,varattr= ,cpuclock=Fixed,macaddr=02:5a:f2:25:37:ba,version=6.1.2,rectime=1612984963,jobs=218.ip-172-31-24-41.eu-central-1.compute.internal 219.ip-172-31-24-41.eu-central-1.compute.internal 220.ip-172-31-24-41.eu-central-1.compute.internal 221.ip-172-31-24-41.eu-central-1.compute.internal 518.ip-172-31-24-41.eu-central-1.compute.internal mom_service_port = 15002 mom_manager_port = 15003 \[/code] , whereas the qstat output for this node: \[code] Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time 218.ip-172-31-24-41.eu flacscloud batch 000038 6030 -- 4 -- 48:00:00 R 46:13:51 ip-172-31-68-184/0 219.ip-172-31-24-41.eu flacscloud batch 000039 6039 -- 4 -- 48:00:00 R 46:13:51 ip-172-31-68-184/1 220.ip-172-31-24-41.eu flacscloud batch 000056 6046 -- 4 -- 48:00:00 R 46:13:51 ip-172-31-68-184/2 221.ip-172-31-24-41.eu flacscloud batch 000060 6062 -- 4 -- 48:00:00 R 46:13:51 ip-172-31-68-184/3 518.ip-172-31-24-41.eu flacscloud batch 012310 112846 -- 2 -- 48:00:00 R 23:16:18 ip-172-31-68-184/4 \[/code] it is clear that sum of TSK for running jobs is greater than number of CPUs. This observation can be confirmed while running `top` on this node, the node is overloaded. Why would that happen and how can I fix this behavior? Edited by: mfolusiak on Feb 10, 2021 12:03 PM Edited by: mfolusiak on Feb 10, 2021 1:09 PM
3
answers
0
votes
5
views
asked 2 years ago