I'm trying to enable slurm accounting for multiple clusters created with AWS ParallelCluster 3, following this guide. I successfully enabled accounting for the first cluster (cluster-one
), and now I'm trying to setup the second one (cluster-two
) using the recommended way under "Replicate the process on multiple clusters" in the above page, that is, to use a single slurmdbd
instance.
However, the connection to slurmdbd
from cluster-two
to cluster-one
is not working.
This is from the slurmdbd.log
file on cluster-one
:
[2022-05-22T15:09:02.965] error: Munge decode failed: Invalid credential
[2022-05-22T15:09:02.966] auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970
[2022-05-22T15:09:02.966] auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970
[2022-05-22T15:09:02.966] error: slurm_unpack_received_msg: auth_g_verify: REQUEST_PERSIST_INIT has authentication error: Unspecified error
[2022-05-22T15:09:02.966] error: slurm_unpack_received_msg: Protocol authentication error
[2022-05-22T15:09:02.976] error: CONN:10 Failed to unpack SLURM_PERSIST_INIT message
This is from slurmctld.log
on cluster-two
:
[2022-05-09T21:39:36.773] error: slurmdbd: Invalid message version=6500, type:1432
[2022-05-09T21:39:37.250] error: auth_g_pack: protocol_version 6500 not supported
[2022-05-09T21:39:37.250] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error
[2022-05-09T21:39:37.250] error: slurm_persist_conn_open: failed to send persistent connection init message to ip-10-0-21-180:6819
[2022-05-09T21:39:37.250] error: Sending PersistInit msg: Protocol authentication error
[2022-05-09T21:39:37.250] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error
So I suppose it is an authentication problem related to munge, but how do I solve this?
Works perfectly. May I suggest to edit that tutorial to add this bit of information?