- Newest
- Most votes
- Most comments
When you're experiencing job submission failures with Snakemake in AWS ParallelCluster with the error "Code:VolumeLimitExceeded", this indicates an issue with Amazon EBS volume limits being reached during node configuration.
The difference between your manual submission approach and Snakemake automation likely relates to how resources are requested and how quickly nodes are being provisioned. When you submit jobs manually with sbatch, there's typically a natural delay between submissions, allowing the cluster to scale more gradually. Snakemake, even with your 1 submission per second rate limit, might be triggering multiple node launches simultaneously in a pattern that's hitting EBS volume limits for your account.
To troubleshoot this issue:
-
Check the
/var/log/slurmctld.logon the head node to see detailed information about why jobs are failing. -
Examine
/var/log/parallelcluster/slurm_resume.logfor specific errors related to node provisioning. -
Look at
/var/log/parallelcluster/clustermgtdlogs to understand the cluster management daemon's perspective on the issue.
Potential solutions include:
- Request an increase to your EBS volume limits from AWS Support
- Modify your Snakemake workflow to use fewer concurrent jobs or to batch jobs differently
- Configure your cluster to use fewer, larger instances rather than many small ones
- Implement a more gradual scaling strategy in your workflow
If you need to investigate a specific failing node, you can:
- Submit a job to trigger a new node launch
- Modify the instance's shutdown behavior to "stop" rather than "terminate" using the EC2 API
- Enable termination protection
- Tag the node for identification
- Detach it from the cluster by changing the cluster name tag
- Retrieve the console output to analyze the failure
This approach will help you preserve the failing node for troubleshooting rather than having it automatically terminated.
Sources
Troubleshooting scaling issues - AWS ParallelCluster
AWS ParallelCluster troubleshooting - AWS ParallelCluster
Relevant content
- asked a month ago
- AWS OFFICIALUpdated a year ago
