내용으로 건너뛰기

ParallelCluster Snakemake node no response

0

Hi guys! I am trying to automate my workflow with snakemake. I have multiple jobs to run. One job was assigned to a node, the job finished successfully. Then another job is assigned to the same node, the job failed and get the following error. Manual submission of multiple jobs does not have this issue (When a node finishes a job and is assigned with another job, it runs without issue). Why?

agent/is_node_resp: node:queue-1-dy-queue-1-cr-2-51 RPC:REQUEST_PING : Socket timed out on send/recv operation.

Best, Paca

질문됨 한 달 전28회 조회
1개 답변
0

This error indicates a communication timeout between the Slurm controller and a compute node in your ParallelCluster environment. When you see "Socket timed out on send/recv operation" with "RPC:REQUEST_PING", it means Slurm tried to ping the node but couldn't get a response.

In AWS ParallelCluster with Slurm scheduler, when a job finishes on a dynamic node, the node typically enters a POWER_DOWN state after the configured idle time (scaledown_idletime) has passed. The node is then terminated and reset back to POWER_SAVING state for future use.

When Snakemake submits a second job to the same node that just completed a job, there might be timing issues where:

  1. The node might be in the process of transitioning to POWER_DOWN state
  2. The node might have already been terminated but not yet fully reset to POWER_SAVING state
  3. The node might be experiencing health issues that weren't apparent during the first job

Unlike manual job submissions where you might naturally introduce delays between jobs, Snakemake's automated workflow might be submitting jobs in rapid succession without allowing enough time for the node state transitions to complete properly.

To resolve this issue:

  1. Check the node states using sinfo to see if nodes are properly transitioning between states
  2. Consider increasing the scaledown_idletime parameter to keep nodes available longer
  3. Add a small delay between job submissions in your Snakemake workflow
  4. Check the relevant logs (ResumeProgram, SuspendProgram, clustermgtd, slurmctld) for more specific error information

If the issue persists, you might need to investigate whether there are specific node failures occurring by examining the AWS ParallelCluster logs.
Sources
Slurm guide for multiple queue mode - AWS ParallelCluster
Slurm guide for multiple queue mode - AWS ParallelCluster

답변함 한 달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

관련 콘텐츠