By using AWS re:Post, you agree to the Terms of Use
/AWS ParallelCluster/

Questions tagged with AWS ParallelCluster

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Cluster created with ParallelCluster will not run jobs

UPDATE: I answered this question for myself. I re-created the AMI but manually (following these docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.update-cluster-v3.html#modify-an-aws-parallelcluster-ami ) this time and it worked. Odd because the documentation cautions against this but it worked better than creating the AMI using pcluster. Can't delete the question so here it is for the record. I created a slurm cluster using AWS ParallelCluster (the `pcluster` tool). Creation works fine and I can ssh to the head node. But when I submit jobs they do not run. Using `srun`: ``` $ srun echo hello world srun: error: Node failure on queue1-dy-t2micro-1 srun: Force Terminated job 1 srun: error: Job allocation 1 has been revoked ``` Using `sbatch`: ``` $ sbatch t.sh Submitted batch job 2 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 queue1 t.sh ubuntu CF 0:02 1 queue1-dy-t2micro-2 ``` Above it looks like it is going to start a job on host `queue1-dy-t2micro-2` but that host never comes up, or at least does not stay up, and after a little bit, I see this: ``` $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 queue1 t.sh ubuntu PD 0:00 1 (BeginTime) ``` And then subsequently, the job is never run. Anyone know what is going on? I did use a custom AMI which I also built with pcluster, but I am not sure if that is the issue, because the head node comes up just fine and it is using the same AMI.
0
answers
0
votes
4
views
asked 2 days ago

Job on parallel cluster using hpc6a.48xlarge not running

Hello, I want to run a simulation on a parallel cluster (alinux2) using 2 hpc6a.48xlarge instances (192 CPUs). I created the cluster and submitted the job using slurm. The problem is that the job stays waiting in the queue and never runs (I left it for more than 1 day). I tried running the same job using another kind of instance, with the same number of CPUs and it worked perfectly, so it is an issue with this specific instance (hpc6a.48xlarge). I also tried using only 1 hpc6a.48xlarge instance (96 CPUs) but did not work either. I copy the squeue information at the end of the message. It shows some 'BeginTime' reasons, although I have not programmed my job to start later. What may be the reason for this issue? I am creating the cluster on a new company account. May the issue be related to the usage of the account? I ask this because I have already configured the same cluster on a personal account (with significantly more usage than the company account) and the job is executed almost immediately. I would appreciate any advice on resolving this issue [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None)
4
answers
0
votes
5
views
asked 2 months ago

Does anyone have OpenZFS filesystem with AWS Parallelcluster working ?

Q: Has anyone created an OpenZFS volume in FSx and been able to mount the volume using the pcluster3 cli in AWS Parallelcluster ? It doesn't seem to work. I created an OpenZFS filesystem, but I can't get it to mount using from in the pcluster cli. I do not have this issue with other filesystems. The YAML descriptors for the filesystem mount looks like this: ``` SharedStorage: - Name: modelingtest StorageType: FsxLustre MountDir: /fsx FsxLustreSettings: FileSystemId: fs-045ffe08c17984010 ``` It seems the CLI spits out a non-descriptive error before the cluster build even starts. This looks like a cli bug to me. ``` pcluster create-cluster --cluster-configuration ./bluefishtestfsx.yaml --cluster-name testami --region us-east-1 { "message": "'NoneType' object has no attribute 'get'" } ``` The documenation for this is interesting too. It says you need to specifiy the storage size unless the "FileSystemID" is specified. https://docs.aws.amazon.com/parallelcluster/latest/ug/SharedStorage-v3.html If I add the filesystem size, we get a different error which makes sense. ``` pcluster create-cluster --cluster-configuration ./bluefishtestfsx.yaml --cluster-name testami --region us-east-1 { "configurationValidationErrors": [ { "level": "ERROR", "type": "ConfigSchemaValidator", "message": "[('SharedStorage', {0: {'FsxLustreSettings': {'_schema': ['storage_capacity is ignored when an existing Lustre file system is specified.']}}})]" } ], "message": "Invalid cluster configuration." } ```
1
answers
0
votes
3
views
asked 2 months ago

Creating custom YAML files for AWS Parallel Cluster

I am trying to follow the tutorial for running FDS/SMV on AWS Parallel Cluster here: https://aws.amazon.com/blogs/compute/fire-dynamics-simulation-cfd-workflow-using-aws-parallelcluster-elastic-fabric-adapter-amazon-fsx-for-lustre-and-nice-dcv/ . I get to the point where it asks me to setup a config file with the following data: ``` [aws] aws_region_name = <AWS-REGION> [global] sanity_check = true cluster_template = fds-smv-cluster update_check = true [vpc public] vpc_id = vpc-<VPC-ID> m ster_subnet_id = subnet-<SUBNET-ID> [cluster fds-smv-cluster] key_name = <Key-Name> vpc_settings = public compute_instance_type=c5n.18xlarge m ster_instance_type=c5.xlarge initial_queue_size = 0 max_queue_size = 100 scheduler=slurm cluster_type = ondemand s3_read_write_resource=arn:aws:s3:::fds-smv-bucket-unique* placement_group = DYNAMIC placement = compute base_os = alinux2 tags = {"Name" : "fds-smv"} disable_hyperthreading = true fsx_settings = fsxshared enable_efa = compute dcv_settings = hpc-dcv [dcv hpc-dcv] enable = m ster [fsx fsxshared] shared_dir = /fsx storage_capacity = 1200 import_path = s3://fds-smv-bucket-unique imported_file_chunk_size = 1024 export_path = s3://fds-smv-bucket-unique [aliases] ssh = ssh {CFN_USER}@{M STER_IP} {ARGS} ``` I am unable to create a YAML file that will be accepted by Parallel Cluster to create-cluster. It returns the error: ``` { "message": "Bad Request: Configuration must be a valid YAML document" } ``` I attempted to create a YAML file using AWS Parallel Cluster configure wizard (https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-configuring.html) but it does not have all the specifications, like a shared s3 bucket in Fsx, like the tutorial asks for. I tried starting with the wizard created configuration file and editing it so it looks like the YAML files in documentation, but it still returns the same error. See my edited YAML file here: ``` Region: us-east-2 Image: Os: alinux2 HeadNode: InstanceType: c5.xlarge Networking: SubnetId: subnet-032f3e6409362aff2 Ssh: KeyName: MyKeyPair1 DisableSimultaneousMultithreading: true Scheduling: Scheduler: slurm SlurmQueues: - Name: queue1 CapacityType: ONDEMAND ComputeResources: - Name: c5n18xlarge InstanceType: c5n.18xlarge MinCount: 0 MaxCount: 100 Efa: Enabled: true Networking: SubnetIds: - subnet-032f3e6409362aff2 Iam: S3Access: - BucketName: fds-smv-jts-bucket-1 EnableWriteAccess: True AdditionalIamPolicies: - Policy: arn:aws:s3:::fds-smv-jts-bucket-1* SharedStorage: - MountDir: /fsx StorageType: FsxLustre FsxLustreSettings: StorageCapacity: 1200 ImportedFileChunkSize: 1024 ExportPath: s3://fds-smv-jts-bucket-1 ImportPath: s3://fds-smv-jts-bucket-1 Tags: - Key: String Value: fds-smv DevSettings ClusterTemplate: fds-smv-cluster ``` Any ideas on how to create the proper YAML file with all the data that is requested for the tutorial? Thank you!
1
answers
0
votes
31
views
asked 4 months ago

Enter passphrase for key '/home/ec2-user/.ssh/lab-3-key - Cloud9, pcluster, FDS

I am attempting to run FDS (CFD code for fire simulation) on Cloud9 pcluster using the following tutorial: https://fds-smv-on-pcluster.workshop.aws/oyo/setup/pcluster.html A few of the input commands in the tutorial are out of date, but* I have gotten to the point where you connect to the PCluster with your created ssh lab key to download and install FDS software:* ``` pcluster ssh --cluster-name pc-fsx -i ~/.ssh/lab-3-key ``` And *I get asked for a passphrase:* **Enter passphrase for key '/home/ec2-user/.ssh/lab-3-key': ** I do not know what this passphrase is. Is it created when I create my lab key? Is it from my AWS account or Cloud9 environment? I am relatively new to AWS and Python and command line. FYI, the key is created as follows: ``` IFACE=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/) SUBNET_ID=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/${IFACE}/subnet-id) VPC_ID=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/${IFACE}/vpc-id) REGION=$(curl --silent http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/[a-z]$//') AWS_REGION=$(curl --silent http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/[a-z]$//') ``` ``` # generate a new key-pair aws ec2 create-key-pair --key-name lab-3-your-key --query KeyMaterial --output text --region=${AWS_REGION} > ~/.ssh/lab-3-key chmod 600 ~/.ssh/lab-3-key ``` I believe I did everything as the tutorial asked, as far as I could tell. Any help?
1
answers
0
votes
5
views
asked 4 months ago

unknown slowdown in parallelcluster

I've noticed that the amount of time to complete the jobs in my task array varies dramatically. Any idea what is causing it? The speed of the jobs seems very acceptable for the first jobs, but then something goes wrong ... ? I'm using the slurm scheduler 20.11.8 and aws parallelcluster 3.0.2. Below are 2 examples showing the variation in time/job. I plot the time (in seconds) it takes for each job/task (each job is a dot). (I couldn't see how to attach an image, so I'm providing links.) example 1: 800 jobs [https://ibb.co/KrrwhXn](https://ibb.co/KrrwhXn) You can see that the first ~400 tasks complete in roughly 400 seconds per job, and then jobs 400 to 750 take about 6000 seconds. example 2: 300 jobs: [https://ibb.co/4RdTpzg](https://ibb.co/4RdTpzg) You can see that the first 50 jobs run slower than jobs 50-150, and then jobs 150-200 are slowest. In both cases I'm running 50 nodes at a time. It seems like the duration of the job is related to the number of jobs each instance has run. In other words, the speed of the task often changes considerably at each multiple of 50. When I change the number of nodes running at a time, I still observe this pattern. Each job is basically equal in the amount of "work" there is to do (within 5%), so it's *not* the case, for example, that jobs 150-200 in example 2 are "harder" than the other jobs. Actually the 2 examples above are the exact same jobs (but in example 2 I only ran the first 300 of 800 jobs). What I've tried: 1. I've used different instance types, but I observe this slowdown across all instance types 2. I've used different number of nodes, but whether I use 20, 40, or 50, I observe this slowdown. 3. I've observed the CPU and memory usage in both the head node and nodes in the compute fleet, and it seems reasonable. when I use -top- to monitor, the highest usage process generally is using less than 1% of memory and 1% of CPU. 4. I've explored these logs in the **head** node, but I haven't found anything that's clearly wrong: * /var/log/cfn-init.log * /var/log/chef-client.log * /var/log/parallelcluster/slurm_resume.log * /var/log/parallelcluster/slurm_suspend.log * /var/log/parallelcluster/clustermgtd * /var/log/slurmctld.log 5. I've explored these logs in the **compute** node, but I haven't found anything that's clearly wrong: * /var/log/cloud-init-output.log * /var/log/parallelcluster/computemgtd * /var/log/slurmd.log Here's my configuration file: ``` Region: us-east-1 Image: Os: alinux2 HeadNode: CustomActions: OnNodeConfigured: Script: s3://my-bucket/head.sh InstanceType: t2.medium Networking: SubnetId: [snip] Ssh: KeyName: [snip] Scheduling: Scheduler: slurm SlurmQueues: - Name: queue1 ComputeResources: - Name: t2medium InstanceType: t2.medium MinCount: 0 MaxCount: 101 Networking: SubnetIds: - subnet-[snip] CustomActions: OnNodeConfigured: Script: s3://my-bucket/node.sh ``` I'm limiting the number of nodes running (50) in the following way: ``` #!/bin/sh #SBATCH --partition queue1 #SBATCH --array=1-800%50 #SBATCH --nice=100 ```
3
answers
0
votes
7
views
asked 5 months ago
  • 1
  • 90 / page