AWS Parallelcluster: slurmdbd for multiple clusters

0

I'm trying to enable slurm accounting for multiple clusters created with AWS ParallelCluster 3, following this guide. I successfully enabled accounting for the first cluster (cluster-one), and now I'm trying to setup the second one (cluster-two) using the recommended way under "Replicate the process on multiple clusters" in the above page, that is, to use a single slurmdbd instance.

However, the connection to slurmdbd from cluster-two to cluster-one is not working. This is from the slurmdbd.log file on cluster-one:

[2022-05-22T15:09:02.965] error: Munge decode failed: Invalid credential
[2022-05-22T15:09:02.966] auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970
[2022-05-22T15:09:02.966] auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970
[2022-05-22T15:09:02.966] error: slurm_unpack_received_msg: auth_g_verify: REQUEST_PERSIST_INIT has authentication error: Unspecified error
[2022-05-22T15:09:02.966] error: slurm_unpack_received_msg: Protocol authentication error
[2022-05-22T15:09:02.976] error: CONN:10 Failed to unpack SLURM_PERSIST_INIT message

This is from slurmctld.log on cluster-two:

[2022-05-09T21:39:36.773] error: slurmdbd: Invalid message version=6500, type:1432
[2022-05-09T21:39:37.250] error: auth_g_pack: protocol_version 6500 not supported
[2022-05-09T21:39:37.250] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2022-05-09T21:39:37.250] error: slurm_persist_conn_open: failed to send persistent connection init message to ip-10-0-21-180:6819
[2022-05-09T21:39:37.250] error: Sending PersistInit msg: Protocol authentication error
[2022-05-09T21:39:37.250] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error

So I suppose it is an authentication problem related to munge, but how do I solve this?

asked 2 years ago672 views
1 Answer
0
Accepted Answer

Hi, to enable communication in a federation of Slurm clusters you have to use the same munge key.

For each new cluster ParallelCluster is generating a random key to be used by Munge, so you need to use the same in the two clusters. You can take the key from the first cluster, it is the /etc/munge/munge.key file and then replicate it in the other cluster, with the right permissions.

After copying the munge key you need to restart the daemons in the head node (both munged and slurmctld).

AWS
answered 2 years ago
  • Works perfectly. May I suggest to edit that tutorial to add this bit of information?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions