AWS Parallelcluster: slurmdbd for multiple clusters

0

I'm trying to enable slurm accounting for multiple clusters created with AWS ParallelCluster 3, following this guide. I successfully enabled accounting for the first cluster (cluster-one), and now I'm trying to setup the second one (cluster-two) using the recommended way under "Replicate the process on multiple clusters" in the above page, that is, to use a single slurmdbd instance.

However, the connection to slurmdbd from cluster-two to cluster-one is not working. This is from the slurmdbd.log file on cluster-one:

[2022-05-22T15:09:02.965] error: Munge decode failed: Invalid credential
[2022-05-22T15:09:02.966] auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970
[2022-05-22T15:09:02.966] auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970
[2022-05-22T15:09:02.966] error: slurm_unpack_received_msg: auth_g_verify: REQUEST_PERSIST_INIT has authentication error: Unspecified error
[2022-05-22T15:09:02.966] error: slurm_unpack_received_msg: Protocol authentication error
[2022-05-22T15:09:02.976] error: CONN:10 Failed to unpack SLURM_PERSIST_INIT message

This is from slurmctld.log on cluster-two:

[2022-05-09T21:39:36.773] error: slurmdbd: Invalid message version=6500, type:1432
[2022-05-09T21:39:37.250] error: auth_g_pack: protocol_version 6500 not supported
[2022-05-09T21:39:37.250] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2022-05-09T21:39:37.250] error: slurm_persist_conn_open: failed to send persistent connection init message to ip-10-0-21-180:6819
[2022-05-09T21:39:37.250] error: Sending PersistInit msg: Protocol authentication error
[2022-05-09T21:39:37.250] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error

So I suppose it is an authentication problem related to munge, but how do I solve this?

gefragt vor 2 Jahren703 Aufrufe
1 Antwort
0
Akzeptierte Antwort

Hi, to enable communication in a federation of Slurm clusters you have to use the same munge key.

For each new cluster ParallelCluster is generating a random key to be used by Munge, so you need to use the same in the two clusters. You can take the key from the first cluster, it is the /etc/munge/munge.key file and then replicate it in the other cluster, with the right permissions.

After copying the munge key you need to restart the daemons in the head node (both munged and slurmctld).

AWS
beantwortet vor 2 Jahren
  • Works perfectly. May I suggest to edit that tutorial to add this bit of information?

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen