Questions tagged with High Performance Compute

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

"pcluster create" results in error during startup - How to address ?

**I am attempting to create a small Parallel cluster using OSX and anaconda. I installed and configured parallelcluster with no errors. The resulting simple config file looks like:** ``` [aws] aws_region_name = us-east-1 [global] cluster_template = default update_check = true sanity_check = true [aliases] ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS} [cluster default] key_name = newmac base_os = ubuntu1604 scheduler = sge initial_queue_size = 2 max_queue_size = 2 maintain_initial_size = true vpc_settings = default [vpc default] vpc_id = vpc-058cb4b123bf54848 master_subnet_id = subnet-04998fb2f4bc80ccc compute_subnet_id = subnet-03c7e26fb2ea81430 use_public_ips = false ``` **When I went to create the cluster here is what I see:** (base) esteban$ pcluster create -c /Users/esteban/.parallelcluster/config first-cluster Beginning cluster creation for cluster: first-cluster WARNING: The configuration parameter 'scheduler' generated the following warnings: The job scheduler you are using (sge) is scheduled to be deprecated in future releases of ParallelCluster. More information is available here: https://github.com/aws/aws-parallelcluster/wiki/Deprecation-of-SGE-and-Torque-in-ParallelCluster Creating stack named: parallelcluster-first-cluster Status: parallelcluster-first-cluster - ROLLBACK_IN_PROGRESS Cluster creation failed. Failed events: - AWS::AutoScaling::AutoScalingGroup ComputeFleet Resource creation cancelled - AWS::CloudFormation::WaitCondition MasterServerWaitCondition Received FAILURE signal with UniqueId i-0438bb35816b58aa5 **Okay so I can still login although there are no SGE binaries installed.** $ pcluster ssh first-cluster -i ~/.ssh/newmac.pem **So I just deleted the cluster which appears to have (I hope) cleaned up all the associated services.** pcluster delete -c /Users/esteban/.parallelcluster/config first-cluster Any ideas ? Thanks Edited by: Stevie on Jul 23, 2020 11:42 AM Edited by: Stevie on Jul 23, 2020 11:46 AM Edited by: Stevie on Jul 23, 2020 11:47 AM
3
answers
0
votes
36
views
Stevie
asked 2 years ago

EFA not working with CentOS 7 and Intel MPI

I have been trying to resolve this problem all week. I followed all the steps to create a 2 instance system with EFA drivers installed. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html I started with the official CentOS 7 HVM image in the marketplace and updated it before installing the efa driver. I have separately and on different images, tried both IntelMPI 2019.5 and 2019.6 (I installed them with the provided aws_impi.sh that points to the amazon efa driver library) Any test I run hangs with efa enabled, the jobs start on both instances but MPI communication is not occurring. export FI_PROVIDER=efa Things run fine using the sockets fabric though. export FI_PROVIDER=sockets c5n.18xlarge instances. I am using an updated CentOS7 image with Kernel revision: 3.10.0-1062.4.1.el7.x86_64 (I was unable to get it working on the previous kernel version 957 also) I'm running in a subnet in a VPC with an attached EFS volume. efa driver version 1.4.1 I did not install or test with openmpi. The efa_test.sh fails, but it works if I use "-p sockets" instead of "-p efa". Another test example: #!/bin/bash module load mpi export FI_PROVIDER=efa #export FI_PROVIDER=sockets mpirun -np 2 -ppn 1 -f $PWD/hosts $PWD/mpi/pt2pt/osu_latency Additional details: fi_info: Instance 1: provider: efa fabric: EFA-fe80::a0:1eff:febf:9407 domain: efa_0-rdm version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::a0:1eff:febf:9407 domain: efa_0-dgrm version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_EFA provider: efa;ofi_rxd fabric: EFA-fe80::a0:1eff:febf:9407 domain: efa_0-dgrm version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD Instance2: provider: efa fabric: EFA-fe80::70:3bff:fee1:52ad domain: efa_0-rdm version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::70:3bff:fee1:52ad domain: efa_0-dgrm version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_EFA provider: efa;ofi_rxd fabric: EFA-fe80::70:3bff:fee1:52ad domain: efa_0-dgrm version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD ibstat CA 'efa_0' CA type: Number of ports: 1 Firmware version: Hardware version: Node GUID: 0x0000000000000000 System image GUID: 0x0000000000000000 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 1 SM lid: 0 Capability mask: 0x00000000 Port GUID: 0x00a01efffebf9407 Link layer: Unknown CA 'efa_0' CA type: Number of ports: 1 Firmware version: Hardware version: Node GUID: 0x0000000000000000 System image GUID: 0x0000000000000000 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 1 SM lid: 0 Capability mask: 0x00000000 Port GUID: 0x00703bfffee152ad Link layer: Unknown Some other errors I have seen but I'm not familiar with: ibsysstat -G 0x00703bfffee152ad ibwarn: \[24340] mad_rpc_open_port: can't open UMAD port ((null):0) ibsysstat: iberror: failed: Failed to open '(null)' port '0' ibhosts ibwarn: \[11769] mad_rpc_open_port: can't open UMAD port ((null):0) /codebuild/output/src613026668/src/rdmacore_build/BUILD/rdma-core-25.0/libibnetdisc/ibnetdisc.c:802; can't open MAD port ((null):0) /usr/sbin/ibnetdiscover: iberror: failed: discover failed Any ideas? Thanks - Patrick
3
answers
0
votes
56
views
asked 3 years ago

Failed to setup parallel cluster on AWS EC2 with Ubuntu OS.

Hi! I am very new to AWS EC2 and AWS Parallel Cluster. I have two virtual compute nodes running on Ubuntu 18.04 with g3s.xlarge and c5.4xlarge instances. My goals are setting up parallel cluster (master & slave nodes) with SLURM job manager for running the calculation on multi-nodes with parallel method like MPI. So far, I have tried to create a new parallel cluster using **pcluster** tool by learning from a quick manual on README.md in aws-parallelcluster github repository and full AWS ParallelCluster manual, but I failed to do that. I have also tweaked the config file which is stored at $HOME/.parallelcluster folder, and even added **--norollback** option, but the errors still persist. _My modified config file:_ ``` [aws] aws_region_name = us-east-1 [cluster hpctest] key_name = XXXXXXXX base_os = alinux master_instance_type = g3s.xlarge master_root_volume_size = 64 compute_instance_type = c5.4xlarge compute_root_volume_size = 64 initial_queue_size = 0 max_queue_size = 8 maintain_initial_size = false custom_ami = ami-0fd18b144da8357b7 scheduler = slurm cluster_type = spot placement_group = DYNAMIC placement = cluster ebs_settings = shared fsx_settings = fs vpc_settings = public [ebs shared] shared_dir = shared volume_type = st1 volume_size = 500 [fsx fs] shared_dir = /fsx storage_capacity = 3600 [global] cluster_template = hpctest update_check = true sanity_check = true [vpc public] vpc_id = vpc-XXXXXXXX master_subnet_id = subnet-XXXXXXXX [aliases] ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS} [scailing custom] scaledown_idletime = 1 ``` Note: I take custom_ami id from <https://github.com/aws/aws-parallelcluster/blob/master/amis.txt>. _The errors I am facing with:_ ``` ubuntu@ip-172-XX-XX-XXX:~$ pcluster create t1 Beginning cluster creation for cluster: t1 Creating stack named: parallelcluster-t1 Status: parallelcluster-t1 - ROLLBACK_IN_PROGRESS Cluster creation failed. Failed events: - AWS::EC2::SecurityGroup MasterSecurityGroup Resource creation cancelled - AWS::EC2::PlacementGroup DynamicPlacementGroup Resource creation cancelled - AWS::EC2::EIP MasterEIP Resource creation cancelled - AWS::CloudFormation::Stack EBSCfnStack Resource creation cancelled - AWS::DynamoDB::Table DynamoDBTable Resource creation cancelled - AWS::IAM::Role RootRole API: iam:CreateRole User: arn:aws:iam::3043XXXXXXXX:user/nutt is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::3043XXXXXXXX:role/parallelcluster-t1-RootRole-1L9A3XXXXXXXX ``` Can anyone help me to solve this problem? If there is any previous threads asking as same as my questions or facing the same problems, please let me know so that I could start to learning with that. Thank you for your time! Rangsiman
2
answers
0
votes
33
views
asked 3 years ago