AWS Parallelcluster: slurmdbd for multiple clusters

0

I'm trying to enable slurm accounting for multiple clusters created with AWS ParallelCluster 3, following this guide. I successfully enabled accounting for the first cluster (cluster-one), and now I'm trying to setup the second one (cluster-two) using the recommended way under "Replicate the process on multiple clusters" in the above page, that is, to use a single slurmdbd instance.

However, the connection to slurmdbd from cluster-two to cluster-one is not working. This is from the slurmdbd.log file on cluster-one:

[2022-05-22T15:09:02.965] error: Munge decode failed: Invalid credential
[2022-05-22T15:09:02.966] auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970
[2022-05-22T15:09:02.966] auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970
[2022-05-22T15:09:02.966] error: slurm_unpack_received_msg: auth_g_verify: REQUEST_PERSIST_INIT has authentication error: Unspecified error
[2022-05-22T15:09:02.966] error: slurm_unpack_received_msg: Protocol authentication error
[2022-05-22T15:09:02.976] error: CONN:10 Failed to unpack SLURM_PERSIST_INIT message

This is from slurmctld.log on cluster-two:

[2022-05-09T21:39:36.773] error: slurmdbd: Invalid message version=6500, type:1432
[2022-05-09T21:39:37.250] error: auth_g_pack: protocol_version 6500 not supported
[2022-05-09T21:39:37.250] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2022-05-09T21:39:37.250] error: slurm_persist_conn_open: failed to send persistent connection init message to ip-10-0-21-180:6819
[2022-05-09T21:39:37.250] error: Sending PersistInit msg: Protocol authentication error
[2022-05-09T21:39:37.250] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error

So I suppose it is an authentication problem related to munge, but how do I solve this?

已提問 2 年前檢視次數 703 次
1 個回答
0
已接受的答案

Hi, to enable communication in a federation of Slurm clusters you have to use the same munge key.

For each new cluster ParallelCluster is generating a random key to be used by Munge, so you need to use the same in the two clusters. You can take the key from the first cluster, it is the /etc/munge/munge.key file and then replicate it in the other cluster, with the right permissions.

After copying the munge key you need to restart the daemons in the head node (both munged and slurmctld).

AWS
已回答 2 年前
  • Works perfectly. May I suggest to edit that tutorial to add this bit of information?

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南