Skip to content

ParallelCluster Snakemake node no response

0

Hi guys! I am trying to automate my workflow with snakemake. I have multiple jobs to run. One job was assigned to a node, the job finished successfully. Then another job is assigned to the same node, the job failed and get the following error. Manual submission of multiple jobs does not have this issue (When a node finishes a job and is assigned with another job, it runs without issue). Why?

agent/is_node_resp: node:queue-1-dy-queue-1-cr-2-51 RPC:REQUEST_PING : Socket timed out on send/recv operation.

Best, Paca

asked 17 days ago18 views
1 Answer
0

This error indicates a communication timeout between the Slurm controller and a compute node in your ParallelCluster environment. When you see "Socket timed out on send/recv operation" with "RPC:REQUEST_PING", it means Slurm tried to ping the node but couldn't get a response.

In AWS ParallelCluster with Slurm scheduler, when a job finishes on a dynamic node, the node typically enters a POWER_DOWN state after the configured idle time (scaledown_idletime) has passed. The node is then terminated and reset back to POWER_SAVING state for future use.

When Snakemake submits a second job to the same node that just completed a job, there might be timing issues where:

  1. The node might be in the process of transitioning to POWER_DOWN state
  2. The node might have already been terminated but not yet fully reset to POWER_SAVING state
  3. The node might be experiencing health issues that weren't apparent during the first job

Unlike manual job submissions where you might naturally introduce delays between jobs, Snakemake's automated workflow might be submitting jobs in rapid succession without allowing enough time for the node state transitions to complete properly.

To resolve this issue:

  1. Check the node states using sinfo to see if nodes are properly transitioning between states
  2. Consider increasing the scaledown_idletime parameter to keep nodes available longer
  3. Add a small delay between job submissions in your Snakemake workflow
  4. Check the relevant logs (ResumeProgram, SuspendProgram, clustermgtd, slurmctld) for more specific error information

If the issue persists, you might need to investigate whether there are specific node failures occurring by examining the AWS ParallelCluster logs.
Sources
Slurm guide for multiple queue mode - AWS ParallelCluster
Slurm guide for multiple queue mode - AWS ParallelCluster

answered 17 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.