unknown slowdown in parallelcluster

0

I've noticed that the amount of time to complete the jobs in my task array varies dramatically. Any idea what is causing it? The speed of the jobs seems very acceptable for the first jobs, but then something goes wrong ... ?

I'm using the slurm scheduler 20.11.8 and aws parallelcluster 3.0.2.

Below are 2 examples showing the variation in time/job. I plot the time (in seconds) it takes for each job/task (each job is a dot). (I couldn't see how to attach an image, so I'm providing links.)

example 1: 800 jobs https://ibb.co/KrrwhXn

You can see that the first ~400 tasks complete in roughly 400 seconds per job, and then jobs 400 to 750 take about 6000 seconds.

example 2: 300 jobs: https://ibb.co/4RdTpzg

You can see that the first 50 jobs run slower than jobs 50-150, and then jobs 150-200 are slowest.

In both cases I'm running 50 nodes at a time. It seems like the duration of the job is related to the number of jobs each instance has run. In other words, the speed of the task often changes considerably at each multiple of 50. When I change the number of nodes running at a time, I still observe this pattern. Each job is basically equal in the amount of "work" there is to do (within 5%), so it's not the case, for example, that jobs 150-200 in example 2 are "harder" than the other jobs. Actually the 2 examples above are the exact same jobs (but in example 2 I only ran the first 300 of 800 jobs).

What I've tried:

  1. I've used different instance types, but I observe this slowdown across all instance types
  2. I've used different number of nodes, but whether I use 20, 40, or 50, I observe this slowdown.
  3. I've observed the CPU and memory usage in both the head node and nodes in the compute fleet, and it seems reasonable. when I use -top- to monitor, the highest usage process generally is using less than 1% of memory and 1% of CPU.
  4. I've explored these logs in the head node, but I haven't found anything that's clearly wrong:
  • /var/log/cfn-init.log
  • /var/log/chef-client.log
  • /var/log/parallelcluster/slurm_resume.log
  • /var/log/parallelcluster/slurm_suspend.log
  • /var/log/parallelcluster/clustermgtd
  • /var/log/slurmctld.log
  1. I've explored these logs in the compute node, but I haven't found anything that's clearly wrong:
  • /var/log/cloud-init-output.log
  • /var/log/parallelcluster/computemgtd
  • /var/log/slurmd.log

Here's my configuration file:

Region: us-east-1
Image:
  Os: alinux2
HeadNode:
  CustomActions:
    OnNodeConfigured:
      Script: s3://my-bucket/head.sh  
  InstanceType: t2.medium
  Networking:
    SubnetId: [snip]
  Ssh:
    KeyName: [snip]
Scheduling:
  Scheduler: slurm
  SlurmQueues:
  - Name: queue1
    ComputeResources:
    - Name: t2medium
      InstanceType: t2.medium
      MinCount: 0
      MaxCount: 101
    Networking:
      SubnetIds:
      - subnet-[snip]
    CustomActions:  
      OnNodeConfigured:
        Script: s3://my-bucket/node.sh

I'm limiting the number of nodes running (50) in the following way:

#!/bin/sh
#SBATCH --partition     queue1
#SBATCH --array=1-800%50                
#SBATCH --nice=100
  • There is a potential log4j patcher agent issue that can impact performance. Here is the article on that: https://github.com/aws/aws-parallelcluster/wiki/Possible-performance-degradation-due-to-log4j-cve-2021-44228-hotpatch-service-on-Amazon-Linux-2. There are a few optional mitigations in there so that you can pick the one that is right for your environment.

  • Thanks a lot for your idea, @Chris Pollard.

    In order to verify if the patcher service is running , I ran this command: sudo systemctl status log4j-cve-2021-44228-hotpatch

    this was the output: ● log4j-cve-2021-44228-hotpatch.service Loaded: masked (/dev/null; bad) Active: inactive (dead)

    I'm not very familiar with the sudo systemctl status commands, but it seems like the patcher is not running? If so, that might be because I'm not running any Java applications that I know of.

    If this patcher was actually the cause, I would expect the effect to be somewhat constant. For example, slow all the time. But what I observe is that the speed is fine for some period and then there's a sudden spike in run times.

  • Hmmm...If you can share, what exactly is the job doing? From where is it pulling the data that you're processing? Where are results being stored? I think what Xin Xin means by disable thread is to set the DisableSimultaneousMultithreading to true (https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-ComputeResources-DisableSimultaneousMultithreading)

  • Hey Chris, sorry for the delay in replying! I’m running a proprietary software that solves various simulations. I store my scripts in $HOME/PROJECT1/scripts/ output is saved in $HOME/PROJECT1/ouput/
    Initially I had the software saved in $HOME. When I was troubleshooting this issue I had wondered if the problem was the nodes having some conflict accessing this shared directory. So I copy the software from $HOME to /usr using a when the node configuration is completed (bootstrap action; OnNodeConfigured).

    I’ve tried to figure out what “fixes” the issue after it’s occurred. These things do not work:

    • simply stop the job and restart job with the same nodes
    • delete all the output and restart job with the same nodes
    • to kill all the nodes, delete all the output, and restart job with new nodes

    ... but if I wait for some period of time (not sure how long it must be), then the slowdown goes away.

asked 2 years ago359 views
3 Answers
1

I guess it because of using T-family instance as compute node. T-family is burstable instance, and provides a baseline CPU performance with the ability to burst above the baseline at any time for as long as required. It spend CPU credit to do bursting, If there are no accrued credits remaining, then the instance gradually comes down to baseline CPU utilization and cannot burst above baseline until it accrues more credits. So I guess when you submit first 50 jobs, it use up the credit, and then CPU had to be limited in second 50 jobs to earn credit. Moreover, a job is a process in OS, and a CPU core is independent execution unit, so the number of process must match the number of the CPU core to achieve the highest performance to avoid process switching. So I suggest you can re-test this case with C or M family instance and disable thread.

Xin Xin
answered 2 years ago
  • Thanks for your thoughts @Xin Xin. I've also observed the same behavior using c5xlarge and m5large instances.

    Can you explain what you mean by "disable thread"?

    I've noticed that the CPU usage goes down when the jobs slow down.

    Here's a graph of the CPU activity in a recent test I ran: https://ibb.co/KbXYdcH

    Could there be some sort of account limit that I'm hitting? If so, how could I check for that?

0

bump! Any ideas for things I can try?

answered 2 years ago
0

Another reason maybe is OpenSlurm scheduler performance issues, Default config of scheduler does not meet High-Through jobs schedule requirement. You may adjust scheduler's config.

You can check scheduler's performance information by command "sdiag", and refer to : https://slurm.schedmd.com/high_throughput.html

Xin Xin
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions