How does Slurm sync files between compute and master nodes?

0

I've setup an High Performance Cluster on AWS similar to the one described in this blog post: https://aws.amazon.com/blogs/compute/running-ansys-fluent-on-amazon-ec2-c5n-with-elastic-fabric-adapter-efa/. The resulting cluster has one master that spins up one compute node.

Consider the following file (saved as test_slurm.sh):

#!/bin/bash

#SBATCH --job-name=test
#SBATCH --output=res.txt

#SBATCH --ntasks=1
#SBATCH --time=10:00

ip a > file.txt

When I run: sbatch test_slurm.sh from the master node, a new file.txt pops up in the same directory with IP information matching the compute node. If I ssh into the compute node, the file is available there as well.

It seems to me that the compute node executes the content of test_slurm.sh, saves a file in its file system and somehow syncs that with the master node. What mechanism is responsible for the file sync? Are the files synced in this manner encrypted in transit?

feita há 3 anos1172 visualizações
5 Respostas
0
Resposta aceita

Hi ProlucidDavid,

I assume you are working in the default directory /home/<cluster_user>.
If this is the case, we always share /home from head node to all compute nodes via NFS, so essentially you are accessing the same /home/<cluster_user>/file.txt from both the head node and compute node.

We also share a number of directories via NFS from head node to compute node, depending on your cluster configuration. You can check which directories are shared by checking /etc/exports on the head node.

If you are looking for other filesystem options, we also support other types of shared filesystem such as EFS and FSx for Lustre.

Hope that helps! Please let us know if you have any additional question

Edited by: AWS-Rex on Dec 14, 2020 3:28 PM

Rex-aws
respondido há 3 anos
0

I asked a similar question on stackoverflow. A summary of a response has been included below (as well as a link to the question). The poster indicates that Slurm makes no effort to transfer files and speculates that pcluster has been configured to allow for file exchange. Is this the case? If so, what is that mechanism? Is it encrypted?

Link to stackoverflow: https://stackoverflow.com/questions/65225099/what-mechanism-does-slurm-use-to-sync-files-between-compute-nodes-and-the-master. Copy of answer provided below:

Slurm will assume that there is a common, shared, filesystem available on all compute nodes and will take that as a prerequisite. Often, clusters will have a "home" filesystem, using technologies such as NFS, GPFS, Lustre, GlusterFS, BeeGFS, AndrewFS, etc, along with other filesystems with different performances/reliability tradeoffs.

But Slurm will not make any effort to transfer files to/from compute nodes, except for the submission script.

In your case, this is most probably setup by the procedure you used to spin up the virtual cluster. Indeed, in the blog post you refer to, the configuration file has a line fsx_settings = parallel-fs that seems to indicate there is a parallel filesystem setup. It ifs further configured in the [fsx parallel-fs] section. From reading the AWS documentation, it could be a Lustre filesystem.

As for encryption, it probably isn't as this type of filesystem is designed for performance on private networks, not for security on WANs. The Amazon procedure most probably configures a private network for the compute nodes.

Edited by: ProlucidDavid on Dec 10, 2020 6:32 AM

respondido há 3 anos
0

Hi AWS-Rex,

Thank you for your clarification, I've confirmed that certain directories are being shared by looking at /etc/exports. I have also confirmed that an NFS service is running. This would explain why files are synchronized even though slurm takes no responsibility for that.

Going forward, I need to ensure that all communication between the head and compute node be encrypted. With that in mind do you have any insight on the following questions?
- It looks like there are several ways to add security to NFS, but I can't find any discussion on the NFS setup in pcluster. Do you know if it is setup with some form of security? Is there a link to documentation on this configuration?
- As you've mentioned pcluster supports EFS and FSx. If I were to add a config section for one of those files systems in the global pcluster config, would they be used instead of NFS? Or would they be used in parallel, with NFS managing the user accounts and the other filesystems used for the shared_dir specified in the pcluster config? https://docs.aws.amazon.com/parallelcluster/latest/ug/global.html

Thanks for your help!

respondido há 3 anos
0

ProlucidDavid,

Regarding the security with which we configure NFS: in general system defaults are used, and no additional configuration for the sake of added security is performed. This isn't currently documented. That said, I'll create this as a feature request.

Regarding the use of EFS or FSx for more security: the use of these file systems would not prevent NFS from being used to manage the home and slurm configuration file directories. They would be used in parallel with NFS.

Regarding encryption of EFS data in transit: according to https://docs.aws.amazon.com/efs/latest/ug/encryption.html#encryption-in-transit , "You can enable encryption of data at rest when creating an Amazon EFS file system. You can enable encryption of data in transit when you mount the file system." When ParallelCluster creates the EFS file system and the encrypted configuration directive is set to true, data is encrypted at rest. Unfortunately I don't think we currently support encryption of data in transit.

~Tim

respondido há 3 anos

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas