Skip to content

Snakemake in ParallelCluster fails job submission

0

Hi!

I created a cluster using ParallelCluster to manage my jobs. I used to submit jobs manually using a bash file that does sbatch multiple times. Jobs will run when there are enough resources and will wait in queue (PD) otherwise. Now I am trying to automate my workflow with snakemake. Even if I set the limit lower than my EC2 quota and the submission rate really low (1 submission per second), some jobs will fail when the nodes are being configured. I checked the slurm log, it gives (Code:VolumeLimitExceeded). I understand it might be EBS related issue but I am confused how it is different from manual submission. Can you help?

Thank you! Paca

asked a month ago26 views
1 Answer
0

When you're experiencing job submission failures with Snakemake in AWS ParallelCluster with the error "Code:VolumeLimitExceeded", this indicates an issue with Amazon EBS volume limits being reached during node configuration.

The difference between your manual submission approach and Snakemake automation likely relates to how resources are requested and how quickly nodes are being provisioned. When you submit jobs manually with sbatch, there's typically a natural delay between submissions, allowing the cluster to scale more gradually. Snakemake, even with your 1 submission per second rate limit, might be triggering multiple node launches simultaneously in a pattern that's hitting EBS volume limits for your account.

To troubleshoot this issue:

  1. Check the /var/log/slurmctld.log on the head node to see detailed information about why jobs are failing.

  2. Examine /var/log/parallelcluster/slurm_resume.log for specific errors related to node provisioning.

  3. Look at /var/log/parallelcluster/clustermgtd logs to understand the cluster management daemon's perspective on the issue.

Potential solutions include:

  • Request an increase to your EBS volume limits from AWS Support
  • Modify your Snakemake workflow to use fewer concurrent jobs or to batch jobs differently
  • Configure your cluster to use fewer, larger instances rather than many small ones
  • Implement a more gradual scaling strategy in your workflow

If you need to investigate a specific failing node, you can:

  1. Submit a job to trigger a new node launch
  2. Modify the instance's shutdown behavior to "stop" rather than "terminate" using the EC2 API
  3. Enable termination protection
  4. Tag the node for identification
  5. Detach it from the cluster by changing the cluster name tag
  6. Retrieve the console output to analyze the failure

This approach will help you preserve the failing node for troubleshooting rather than having it automatically terminated.
Sources
Troubleshooting scaling issues - AWS ParallelCluster
AWS ParallelCluster troubleshooting - AWS ParallelCluster

answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.