Incomplete final Job output Issue in AWS parallel cluster (Slurm Based)

0

Hello Community,

We are facing issue in AWS parallel cluster(Slurm scheduler) where output of job remains incomplete after job completion. This is happening for only long running jobs like 5-6+hrs. for eg final output file should have 10 rows after job gets completed but it just shows 5 rows and rest 5 rows are missing from final output file.

It seems like either some network issue(exceeding network limits) between head node and compute node which gets lost after sometime OR communication between nodes and Storage gets lost after a while(after 2-3 hrs) , hence it doesnt write back the remaining rows/data to final output file in shared storage(which is FsX lustre in our setup)which results in incomplete output of the job. We have also tested this with Static and Dynamic node set up but no difference and issue persists in both the cases.

Further we have done some more testing to isolate the issue :-

  1. when we executed the job on just head node as standalone compute machine and not submitting the job parallel cluster i.e without using Parallel cluster and Slurm scheduler, we could still replicate the issue. So it means that issue doesnt not seem to be with Parallel cluster/Slurm scheduler.
  2. when we executed the job on head node again but this time we stored data/code files on local storage(root volume) of head node (where job should create/write temp or final output files) and not on the shared storage (i.e.FsX lustre) then it WORKS FINE and we dont see the issue in this case. So it seems issue to be with How FsX is mounted on Head/Compute nodes OR some kind of network issues between FsX service and our head/compute nodes during job execution. Because of which the job gets completed but final output remains incomplete due to this network lag or something.

Could it be related to I/O limits on FsX ,which if gets execeeded then it doesnt allow to write on the output file at the particular time resulting into incomplete output file ?

Other configuration's around FsX :- we have two DRA's configured for two different S3 buckets with FsX , both DRA's has Import and Export policy enabled for New,Changed,Deleted events. We also have Data sync scheduled task(to run every 1hr) to sync the data from FsX to S3 buckets. The reason for having the data sync job is AgeOfOldestQueuedMessage metric which gets high(>0) sometimes during job execution and hence the auto-export policy doesnt export complete data from FsX to S3.

Please let us know what could be possible causes here on FsX mounting or any other possible reasons.

Thanks Gaurav

Thanks

Gaurav
asked 10 months ago265 views
2 Answers
0
Accepted Answer

Issue was fixed after increasing the FsX throughput to higher value.

Gaurav
answered 10 months ago
0
AWS
answered 10 months ago
  • Hello,

    Thanks for the information, we have found the issue and it was with FsX throughput limit. After increasing that further, Issue is fixed.

    Thanks,

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions