torque service exits with status 3 on master node

0

Hi,
I noticed strange behavior of my cluster. I am using torque on centos 8. The cluster was working fine for over 2 months and all of the sudden the compute nodes stopped running queued jobs. I tried restarting the compute fleet but this didn't help and I found out that the torque service on the master node had failed and I am not able to restart it (see listing below). What can I do to repair my cluster?
I am using parallelcluster 2.10 with a custom AMI and maximum 12 nodes with 8 processors on each (c5.4xlarge without hyperthreading).
[code]
[centos@ip-172-31-24-41 ~]$ sudo service --status-all
Usage: /etc/init.d/ec2blkdev {start|stop}
\u25cf munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago
Docs: man:munged(8)
Main PID: 5974 (munged)
Tasks: 4 (limit: 47239)
Memory: 3.5M
CGroup: /system.slice/munge.service
\u2514\u25005974 /usr/sbin/munged
\u25cf pbs_sched.service - SYSV: PBS is a batch versatile batch system for SMPs and clusters
Loaded: loaded (/etc/rc.d/init.d/pbs_sched; generated)
Active: active (exited) since Tue 2020-12-01 22:04:05 UTC; 2 months 17 days ago
Docs: man:systemd-sysv-generator(8)
Tasks: 0 (limit: 47239)
Memory: 0B
CGroup: /system.slice/pbs_sched.service
\u25cf pbs_server.service - TORQUE pbs_server daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2020-12-01 22:04:04 UTC; 2 months 17 days ago
Main PID: 6173 (code=exited, status=3)
jobwatcher RUNNING pid 6467, uptime 78 days, 17:53:12
sqswatcher RUNNING pid 6468, uptime 78 days, 17:53:11
\u25cf trqauthd.service - TORQUE trqauthd daemon
Loaded: loaded (/usr/lib/systemd/system/trqauthd.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago
Main PID: 5900 (trqauthd)
Tasks: 1 (limit: 47239)
Memory: 3.1M
CGroup: /system.slice/trqauthd.service
\u2514\u25005900 /opt/torque/sbin/trqauthd -F
[centos@ip-172-31-24-41 ~]$ sudo service pbs_server restart
Restarting pbs_server (via systemctl): [ OK ]
[centos@ip-172-31-24-41 ~]$ sudo service --status-all
Usage: /etc/init.d/ec2blkdev {start|stop}
\u25cf munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago
Docs: man:munged(8)
Main PID: 5974 (munged)
Tasks: 4 (limit: 47239)
Memory: 3.5M
CGroup: /system.slice/munge.service
\u2514\u25005974 /usr/sbin/munged
\u25cf pbs_sched.service - SYSV: PBS is a batch versatile batch system for SMPs and clusters
Loaded: loaded (/etc/rc.d/init.d/pbs_sched; generated)
Active: active (exited) since Tue 2020-12-01 22:04:05 UTC; 2 months 17 days ago
Docs: man:systemd-sysv-generator(8)
Tasks: 0 (limit: 47239)
Memory: 0B
CGroup: /system.slice/pbs_sched.service
\u25cf pbs_server.service - TORQUE pbs_server daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2021-02-18 16:18:48 UTC; 7s ago
Process: 2884631 ExecStart=/opt/torque/sbin/pbs_server -F -d $PBS_HOME $PBS_ARGS (code=exited, status=3)
Main PID: 2884631 (code=exited, status=3)

Feb 18 16:18:48 ip-172-31-24-41 systemd[1]: Started TORQUE pbs_server daemon.
Feb 18 16:18:48 ip-172-31-24-41 pbs_server[2884631]: pbs_server port already bound: Address already in use
Feb 18 16:18:48 ip-172-31-24-41 systemd[1]: pbs_server.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Feb 18 16:18:48 ip-172-31-24-41 systemd[1]: pbs_server.service: Failed with result 'exit-code'.
jobwatcher RUNNING pid 6467, uptime 78 days, 18:14:44
sqswatcher RUNNING pid 6468, uptime 78 days, 18:14:43
\u25cf trqauthd.service - TORQUE trqauthd daemon
Loaded: loaded (/usr/lib/systemd/system/trqauthd.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago
Main PID: 5900 (trqauthd)
Tasks: 1 (limit: 47239)
Memory: 3.1M
CGroup: /system.slice/trqauthd.service
\u2514\u25005900 /opt/torque/sbin/trqauthd -F

[/code]

gefragt vor 3 Jahren326 Aufrufe
2 Antworten
0

Hi @mfolusiak

It's not clear to me why the torque service crashed.
We can find some information in torque logs:
[code]
/var/spool/torque/client_logs/*
/var/spool/torque/server_logs/*
[/code]

Anyway by reading the log message the pbs server cannot start because the port is already in use by another process:
[code]
pbs_server port already bound: Address already in use
[/code]

I'd suggest to stop all the torque related daemons:
[code]
/etc/init.d/trqauthd stop
/etc/init.d/pbs_server stop
/etc/init.d/pbs_sched stop
[/code]

Be sure there are no running processes and eventually kill them
[code]
ps aux | grep trq
ps aux | grep pbs
[/code]

Then restart all the daemons:
[code]
/etc/init.d/trqauthd start
/etc/init.d/pbs_server start
/etc/init.d/pbs_sched start
[/code]

Let us know if it helps,

Enrico

AWS
beantwortet vor 3 Jahren
0

Thank you for help. Yes, restarting the service and the compute fleet had helped.

beantwortet vor 3 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen