- Newest
- Most votes
- Most comments
This error indicates a communication timeout between the Slurm controller and a compute node in your ParallelCluster environment. When you see "Socket timed out on send/recv operation" with "RPC:REQUEST_PING", it means Slurm tried to ping the node but couldn't get a response.
In AWS ParallelCluster with Slurm scheduler, when a job finishes on a dynamic node, the node typically enters a POWER_DOWN state after the configured idle time (scaledown_idletime) has passed. The node is then terminated and reset back to POWER_SAVING state for future use.
When Snakemake submits a second job to the same node that just completed a job, there might be timing issues where:
- The node might be in the process of transitioning to POWER_DOWN state
- The node might have already been terminated but not yet fully reset to POWER_SAVING state
- The node might be experiencing health issues that weren't apparent during the first job
Unlike manual job submissions where you might naturally introduce delays between jobs, Snakemake's automated workflow might be submitting jobs in rapid succession without allowing enough time for the node state transitions to complete properly.
To resolve this issue:
- Check the node states using
sinfoto see if nodes are properly transitioning between states - Consider increasing the scaledown_idletime parameter to keep nodes available longer
- Add a small delay between job submissions in your Snakemake workflow
- Check the relevant logs (ResumeProgram, SuspendProgram, clustermgtd, slurmctld) for more specific error information
If the issue persists, you might need to investigate whether there are specific node failures occurring by examining the AWS ParallelCluster logs.
Sources
Slurm guide for multiple queue mode - AWS ParallelCluster
Slurm guide for multiple queue mode - AWS ParallelCluster
Relevant content
- asked 17 days ago
- asked 7 days ago
