torque service exits with status 3 on master node

0

Hi,
I noticed strange behavior of my cluster. I am using torque on centos 8. The cluster was working fine for over 2 months and all of the sudden the compute nodes stopped running queued jobs. I tried restarting the compute fleet but this didn't help and I found out that the torque service on the master node had failed and I am not able to restart it (see listing below). What can I do to repair my cluster?
I am using parallelcluster 2.10 with a custom AMI and maximum 12 nodes with 8 processors on each (c5.4xlarge without hyperthreading).
[code]
[centos@ip-172-31-24-41 ~]$ sudo service --status-all
Usage: /etc/init.d/ec2blkdev {start|stop}
\u25cf munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago
Docs: man:munged(8)
Main PID: 5974 (munged)
Tasks: 4 (limit: 47239)
Memory: 3.5M
CGroup: /system.slice/munge.service
\u2514\u25005974 /usr/sbin/munged
\u25cf pbs_sched.service - SYSV: PBS is a batch versatile batch system for SMPs and clusters
Loaded: loaded (/etc/rc.d/init.d/pbs_sched; generated)
Active: active (exited) since Tue 2020-12-01 22:04:05 UTC; 2 months 17 days ago
Docs: man:systemd-sysv-generator(8)
Tasks: 0 (limit: 47239)
Memory: 0B
CGroup: /system.slice/pbs_sched.service
\u25cf pbs_server.service - TORQUE pbs_server daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2020-12-01 22:04:04 UTC; 2 months 17 days ago
Main PID: 6173 (code=exited, status=3)
jobwatcher RUNNING pid 6467, uptime 78 days, 17:53:12
sqswatcher RUNNING pid 6468, uptime 78 days, 17:53:11
\u25cf trqauthd.service - TORQUE trqauthd daemon
Loaded: loaded (/usr/lib/systemd/system/trqauthd.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago
Main PID: 5900 (trqauthd)
Tasks: 1 (limit: 47239)
Memory: 3.1M
CGroup: /system.slice/trqauthd.service
\u2514\u25005900 /opt/torque/sbin/trqauthd -F
[centos@ip-172-31-24-41 ~]$ sudo service pbs_server restart
Restarting pbs_server (via systemctl): [ OK ]
[centos@ip-172-31-24-41 ~]$ sudo service --status-all
Usage: /etc/init.d/ec2blkdev {start|stop}
\u25cf munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago
Docs: man:munged(8)
Main PID: 5974 (munged)
Tasks: 4 (limit: 47239)
Memory: 3.5M
CGroup: /system.slice/munge.service
\u2514\u25005974 /usr/sbin/munged
\u25cf pbs_sched.service - SYSV: PBS is a batch versatile batch system for SMPs and clusters
Loaded: loaded (/etc/rc.d/init.d/pbs_sched; generated)
Active: active (exited) since Tue 2020-12-01 22:04:05 UTC; 2 months 17 days ago
Docs: man:systemd-sysv-generator(8)
Tasks: 0 (limit: 47239)
Memory: 0B
CGroup: /system.slice/pbs_sched.service
\u25cf pbs_server.service - TORQUE pbs_server daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2021-02-18 16:18:48 UTC; 7s ago
Process: 2884631 ExecStart=/opt/torque/sbin/pbs_server -F -d $PBS_HOME $PBS_ARGS (code=exited, status=3)
Main PID: 2884631 (code=exited, status=3)

Feb 18 16:18:48 ip-172-31-24-41 systemd[1]: Started TORQUE pbs_server daemon.
Feb 18 16:18:48 ip-172-31-24-41 pbs_server[2884631]: pbs_server port already bound: Address already in use
Feb 18 16:18:48 ip-172-31-24-41 systemd[1]: pbs_server.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Feb 18 16:18:48 ip-172-31-24-41 systemd[1]: pbs_server.service: Failed with result 'exit-code'.
jobwatcher RUNNING pid 6467, uptime 78 days, 18:14:44
sqswatcher RUNNING pid 6468, uptime 78 days, 18:14:43
\u25cf trqauthd.service - TORQUE trqauthd daemon
Loaded: loaded (/usr/lib/systemd/system/trqauthd.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago
Main PID: 5900 (trqauthd)
Tasks: 1 (limit: 47239)
Memory: 3.1M
CGroup: /system.slice/trqauthd.service
\u2514\u25005900 /opt/torque/sbin/trqauthd -F

[/code]

asked 3 years ago316 views
2 Answers
0

Hi @mfolusiak

It's not clear to me why the torque service crashed.
We can find some information in torque logs:
[code]
/var/spool/torque/client_logs/*
/var/spool/torque/server_logs/*
[/code]

Anyway by reading the log message the pbs server cannot start because the port is already in use by another process:
[code]
pbs_server port already bound: Address already in use
[/code]

I'd suggest to stop all the torque related daemons:
[code]
/etc/init.d/trqauthd stop
/etc/init.d/pbs_server stop
/etc/init.d/pbs_sched stop
[/code]

Be sure there are no running processes and eventually kill them
[code]
ps aux | grep trq
ps aux | grep pbs
[/code]

Then restart all the daemons:
[code]
/etc/init.d/trqauthd start
/etc/init.d/pbs_server start
/etc/init.d/pbs_sched start
[/code]

Let us know if it helps,

Enrico

AWS
answered 3 years ago
0

Thank you for help. Yes, restarting the service and the compute fleet had helped.

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions