跳至內容

ParallelCluster Snakemake node no response

0

Hi guys! I am trying to automate my workflow with snakemake. I have multiple jobs to run. One job was assigned to a node, the job finished successfully. Then another job is assigned to the same node, the job failed and get the following error. Manual submission of multiple jobs does not have this issue (When a node finishes a job and is assigned with another job, it runs without issue). Why?

agent/is_node_resp: node:queue-1-dy-queue-1-cr-2-51 RPC:REQUEST_PING : Socket timed out on send/recv operation.

Best, Paca

已提問 2 個月前檢視次數 31 次
1 個回答
0

This error indicates a communication timeout between the Slurm controller and a compute node in your ParallelCluster environment. When you see "Socket timed out on send/recv operation" with "RPC:REQUEST_PING", it means Slurm tried to ping the node but couldn't get a response.

In AWS ParallelCluster with Slurm scheduler, when a job finishes on a dynamic node, the node typically enters a POWER_DOWN state after the configured idle time (scaledown_idletime) has passed. The node is then terminated and reset back to POWER_SAVING state for future use.

When Snakemake submits a second job to the same node that just completed a job, there might be timing issues where:

  1. The node might be in the process of transitioning to POWER_DOWN state
  2. The node might have already been terminated but not yet fully reset to POWER_SAVING state
  3. The node might be experiencing health issues that weren't apparent during the first job

Unlike manual job submissions where you might naturally introduce delays between jobs, Snakemake's automated workflow might be submitting jobs in rapid succession without allowing enough time for the node state transitions to complete properly.

To resolve this issue:

  1. Check the node states using sinfo to see if nodes are properly transitioning between states
  2. Consider increasing the scaledown_idletime parameter to keep nodes available longer
  3. Add a small delay between job submissions in your Snakemake workflow
  4. Check the relevant logs (ResumeProgram, SuspendProgram, clustermgtd, slurmctld) for more specific error information

If the issue persists, you might need to investigate whether there are specific node failures occurring by examining the AWS ParallelCluster logs.
Sources
Slurm guide for multiple queue mode - AWS ParallelCluster
Slurm guide for multiple queue mode - AWS ParallelCluster

已回答 2 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。