AWS Parallelcluster: slurmdbd for multiple clusters

0

I'm trying to enable slurm accounting for multiple clusters created with AWS ParallelCluster 3, following this guide. I successfully enabled accounting for the first cluster (cluster-one), and now I'm trying to setup the second one (cluster-two) using the recommended way under "Replicate the process on multiple clusters" in the above page, that is, to use a single slurmdbd instance.

However, the connection to slurmdbd from cluster-two to cluster-one is not working. This is from the slurmdbd.log file on cluster-one:

[2022-05-22T15:09:02.965] error: Munge decode failed: Invalid credential
[2022-05-22T15:09:02.966] auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970
[2022-05-22T15:09:02.966] auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970
[2022-05-22T15:09:02.966] error: slurm_unpack_received_msg: auth_g_verify: REQUEST_PERSIST_INIT has authentication error: Unspecified error
[2022-05-22T15:09:02.966] error: slurm_unpack_received_msg: Protocol authentication error
[2022-05-22T15:09:02.976] error: CONN:10 Failed to unpack SLURM_PERSIST_INIT message

This is from slurmctld.log on cluster-two:

[2022-05-09T21:39:36.773] error: slurmdbd: Invalid message version=6500, type:1432
[2022-05-09T21:39:37.250] error: auth_g_pack: protocol_version 6500 not supported
[2022-05-09T21:39:37.250] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2022-05-09T21:39:37.250] error: slurm_persist_conn_open: failed to send persistent connection init message to ip-10-0-21-180:6819
[2022-05-09T21:39:37.250] error: Sending PersistInit msg: Protocol authentication error
[2022-05-09T21:39:37.250] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error

So I suppose it is an authentication problem related to munge, but how do I solve this?

preguntada hace 2 años703 visualizaciones
1 Respuesta
0
Respuesta aceptada

Hi, to enable communication in a federation of Slurm clusters you have to use the same munge key.

For each new cluster ParallelCluster is generating a random key to be used by Munge, so you need to use the same in the two clusters. You can take the key from the first cluster, it is the /etc/munge/munge.key file and then replicate it in the other cluster, with the right permissions.

After copying the munge key you need to restart the daemons in the head node (both munged and slurmctld).

AWS
respondido hace 2 años
  • Works perfectly. May I suggest to edit that tutorial to add this bit of information?

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas