- 新しい順
- 投票が多い順
- コメントが多い順
I guess it because of using T-family instance as compute node. T-family is burstable instance, and provides a baseline CPU performance with the ability to burst above the baseline at any time for as long as required. It spend CPU credit to do bursting, If there are no accrued credits remaining, then the instance gradually comes down to baseline CPU utilization and cannot burst above baseline until it accrues more credits. So I guess when you submit first 50 jobs, it use up the credit, and then CPU had to be limited in second 50 jobs to earn credit. Moreover, a job is a process in OS, and a CPU core is independent execution unit, so the number of process must match the number of the CPU core to achieve the highest performance to avoid process switching. So I suggest you can re-test this case with C or M family instance and disable thread.
Thanks for your thoughts @Xin Xin. I've also observed the same behavior using c5xlarge and m5large instances.
Can you explain what you mean by "disable thread"?
I've noticed that the CPU usage goes down when the jobs slow down.
Here's a graph of the CPU activity in a recent test I ran: https://ibb.co/KbXYdcH
Could there be some sort of account limit that I'm hitting? If so, how could I check for that?
Another reason maybe is OpenSlurm scheduler performance issues, Default config of scheduler does not meet High-Through jobs schedule requirement. You may adjust scheduler's config.
You can check scheduler's performance information by command "sdiag", and refer to : https://slurm.schedmd.com/high_throughput.html
There is a potential log4j patcher agent issue that can impact performance. Here is the article on that: https://github.com/aws/aws-parallelcluster/wiki/Possible-performance-degradation-due-to-log4j-cve-2021-44228-hotpatch-service-on-Amazon-Linux-2. There are a few optional mitigations in there so that you can pick the one that is right for your environment.
Thanks a lot for your idea, @Chris Pollard.
In order to verify if the patcher service is running , I ran this command: sudo systemctl status log4j-cve-2021-44228-hotpatch
this was the output: ● log4j-cve-2021-44228-hotpatch.service Loaded: masked (/dev/null; bad) Active: inactive (dead)
I'm not very familiar with the sudo systemctl status commands, but it seems like the patcher is not running? If so, that might be because I'm not running any Java applications that I know of.
If this patcher was actually the cause, I would expect the effect to be somewhat constant. For example, slow all the time. But what I observe is that the speed is fine for some period and then there's a sudden spike in run times.
Hmmm...If you can share, what exactly is the job doing? From where is it pulling the data that you're processing? Where are results being stored? I think what Xin Xin means by disable thread is to set the DisableSimultaneousMultithreading to true (https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-ComputeResources-DisableSimultaneousMultithreading)
Hey Chris, sorry for the delay in replying! I’m running a proprietary software that solves various simulations. I store my scripts in $HOME/PROJECT1/scripts/ output is saved in $HOME/PROJECT1/ouput/
Initially I had the software saved in $HOME. When I was troubleshooting this issue I had wondered if the problem was the nodes having some conflict accessing this shared directory. So I copy the software from $HOME to /usr using a when the node configuration is completed (bootstrap action; OnNodeConfigured).
I’ve tried to figure out what “fixes” the issue after it’s occurred. These things do not work:
... but if I wait for some period of time (not sure how long it must be), then the slowdown goes away.