By using AWS re:Post, you agree to the Terms of Use
/High Performance Compute/

Questions tagged with High Performance Compute

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Job on parallel cluster using hpc6a.48xlarge not running

Hello, I want to run a simulation on a parallel cluster (alinux2) using 2 hpc6a.48xlarge instances (192 CPUs). I created the cluster and submitted the job using slurm. The problem is that the job stays waiting in the queue and never runs (I left it for more than 1 day). I tried running the same job using another kind of instance, with the same number of CPUs and it worked perfectly, so it is an issue with this specific instance (hpc6a.48xlarge). I also tried using only 1 hpc6a.48xlarge instance (96 CPUs) but did not work either. I copy the squeue information at the end of the message. It shows some 'BeginTime' reasons, although I have not programmed my job to start later. What may be the reason for this issue? I am creating the cluster on a new company account. May the issue be related to the usage of the account? I ask this because I have already configured the same cluster on a personal account (with significantly more usage than the company account) and the job is executed almost immediately. I would appreciate any advice on resolving this issue [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime) [ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None)
4
answers
0
votes
5
views
AWS-User-9531109
asked 2 months ago

Trying Sagemaker example but getting error: AttributeError: module 'sagemaker' has no attribute 'create_transform_job'

Hi, I keep getting this error: AttributeError: module 'sagemaker' has no attribute 'create_transform_job', when using a batch transform example that AWS graciously had in the notebook instances. Code: ***Also, I updated Sagemaker to the newest package and its still not working. ``` %%time import time from time import gmtime, strftime batch_job_name = "Batch-Transform-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime()) input_location = "s3://{}/{}/batch/{}".format( bucket, prefix, batch_file ) # use input data without ID column output_location = "s3://{}/{}/output/{}".format(bucket, prefix, batch_job_name) request = { "TransformJobName": batch_job_name, "ModelName": 'xgboost-parquet-example-training-2022-03-28-16-02-31-model', "TransformOutput": { "S3OutputPath": output_location, "Accept": "text/csv", "AssembleWith": "Line", }, "TransformInput": { "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": input_location}}, "ContentType": "text/csv", "SplitType": "Line", "CompressionType": "None", }, "TransformResources": {"InstanceType": "ml.m4.xlarge", "InstanceCount": 1}, } sagemaker.create_transform_job(**request) print("Created Transform job with name: ", batch_job_name) # Wait until the job finishes try: sagemaker.get_waiter("transform_job_completed_or_stopped").wait(TransformJobName=batch_job_name) finally: response = sagemaker.describe_transform_job(TransformJobName=batch_job_name) status = response["TransformJobStatus"] print("Transform job ended with status: " + status) if status == "Failed": message = response["FailureReason"] print("Transform failed with the following error: {}".format(message)) raise Exception("Transform job failed") ``` Everything else is working well. I've had no luck with this on anyother forum.
1
answers
0
votes
9
views
AWS-User-7732475
asked 2 months ago

Setting MKL_NUM_THREADS to be more than 16 for m5 instances

Hey, I have a 32-core EC2 linux m5 instance. My python installed via anaconda. I notice that my numpy cannot use more than 16 cores. Looks like my numpy uses libmkl_rt.so: ``` [2]: np.show_config() blas_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/home/ec2-user/anaconda3/lib'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/home/ec2-user/anaconda3/include'] blas_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/home/ec2-user/anaconda3/lib'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/home/ec2-user/anaconda3/include'] lapack_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/home/ec2-user/anaconda3/lib'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/home/ec2-user/anaconda3/include'] lapack_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/home/ec2-user/anaconda3/lib'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/home/ec2-user/anaconda3/include'] ``` When I tried to set MKL_NUM_THREADS below 16, it works ``` (base) ec2-user@ip-172-31-18-3:~$ export MKL_NUM_THREADS=12 && python -c "import ctypes; mkl_rt = ctypes.CDLL('libmkl_rt.so'); print (mkl_rt.mkl_get_max_threads())" 12 ``` When I tried to set it to 24, it stops at 16 ``` (base) ec2-user@ip-172-31-18-3:~$ export MKL_NUM_THREADS=24 && python -c "import ctypes; mkl_rt = ctypes.CDLL('libmkl_rt.so'); print (mkl_rt.mkl_get_max_threads())" 16 ``` But I do have 32 cores ``` In [2]: os.cpu_count() Out[2]: 32 ``` Is there any other settings I need to check? Thanks, Bill
3
answers
0
votes
4
views
AWS-User-2549579
asked 3 months ago

How to run JVM based HFT application on Graviton 3 CPUs

Thinking of creating a High Frequency Transaction Trading system to test my Algo Trading Strategies. Infrastructure will be AWS EC2 Graviton 3 instance (C7g) + AWS Linux2 + AWS Corretto JVM runtime OR OpenJDK GraalVM distribution. Why Graalvm because its Polyglot and reduces the contextual switch (Marshaling/Un-Marshaling ) between Data Structures of different programing languages. AWS EC2 Graviton 3 coming with more than 100 virtual cores and more than 100 MB of L1 + L2 + L3 cache. Pre-compiled native ARM CPU instructions will be saved in the CodeCache. Data Analytics will be done by Apache Spark 3 and code will JIT aware (Mostly in Scala and R). Data will be populated from SSDs and processed in RAM. Questions 1. Will Amazon Corretto or GraalVM capable of generating executable native instructions and interpreting towards ARM based Graviton CPU. 2. Amazon Corretto is a flavor of OpenJDK. Does the Project GraalVM already merged with Amazon Corretto JVM. Can I replace **Java on Truffle - Mete Circular JIT** as C2 compiler of Amazon Corretto JVM. 3. Where I can refer guides or whitepapers related to Amazon Corretto supporting OpenJDK JEPs and projects. 4. Which super super quick programing language as a choice to write my Algo trading business logic. Expecting a nanosecond latency from the time of a signal enters the Ethernet port of a microprocessor and returns back the result. Better Suggestions and questions are always appreciated.
0
answers
0
votes
4
views
fasil
asked 3 months ago

DLAMI does not have CUDA/NVIDIA (and cannot access cuda from pytorch)

I running on :Deep Learning AMI (Ubuntu 18.04) Version 56.0 - ami-083abc80c473f5d88, but I have tried several similar DLAMI. I am unable to access CUDA from pytorch to train my models. See here: ``` $ apt list --installed | grep -i "nvidia" ``` ``` WARNING: apt does not have a stable CLI interface. Use with caution in scripts. libnvidia-compute-460-server/bionic-updates,bionic-security,now 460.106.00-0ubuntu0.18.04.2 amd64 [installed,automatic] libnvidia-container-tools/bionic,now 1.7.0-1 amd64 [installed,automatic] libnvidia-container1/bionic,now 1.7.0-1 amd64 [installed,automatic] nvidia-container-toolkit/bionic,now 1.7.0-1 amd64 [installed] nvidia-cuda-dev/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic] nvidia-cuda-doc/bionic,now 9.1.85-3ubuntu1 all [installed,automatic] nvidia-cuda-gdb/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic] nvidia-cuda-toolkit/bionic,now 9.1.85-3ubuntu1 amd64 [installed] nvidia-docker2/bionic,now 2.8.0-1 all [installed] nvidia-fabricmanager-450/now 450.142.00-1 amd64 [installed,upgradable to: 450.156.00-0ubuntu0.18.04.1] nvidia-opencl-dev/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic] nvidia-profiler/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic] nvidia-visual-profiler/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic] ``` And it shows I have Nvidia. However, when I run python: ``` ~$ bpython bpython version 0.22.1 on top of Python 3.8.12 /home/ubuntu/anaconda3/envs/pytorch_p38/bin/python3.8 >>> import torch.nn as nn >>> import torch >>> torch.cuda.is_available() False ``` Even after I re-install nvidia ``` sudo apt install nvidia-driver-455 ``` I get this: ``` (pytorch_p38) ubuntu@ip-172-31-95-17:~$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Mon_Oct_12_20:09:46_PDT_2020 Cuda compilation tools, release 11.1, V11.1.105 Build cuda_11.1.TC455_06.29190527_0 (pytorch_p38) ubuntu@ip-172-31-95-17:~$ bpython bpython version 0.22.1 on top of Python 3.8.12 /home/ubuntu/anaconda3/envs/pytorch_p38/bin/python3.8 >>> import torch >>> torch.cuda.is_available() False ``` Does anyone know how to get pytorch to be able to access cuda? Any help is greatly appreciated
0
answers
0
votes
3
views
AWS-User-9348549
asked 4 months ago

Creating custom YAML files for AWS Parallel Cluster

I am trying to follow the tutorial for running FDS/SMV on AWS Parallel Cluster here: https://aws.amazon.com/blogs/compute/fire-dynamics-simulation-cfd-workflow-using-aws-parallelcluster-elastic-fabric-adapter-amazon-fsx-for-lustre-and-nice-dcv/ . I get to the point where it asks me to setup a config file with the following data: ``` [aws] aws_region_name = <AWS-REGION> [global] sanity_check = true cluster_template = fds-smv-cluster update_check = true [vpc public] vpc_id = vpc-<VPC-ID> m ster_subnet_id = subnet-<SUBNET-ID> [cluster fds-smv-cluster] key_name = <Key-Name> vpc_settings = public compute_instance_type=c5n.18xlarge m ster_instance_type=c5.xlarge initial_queue_size = 0 max_queue_size = 100 scheduler=slurm cluster_type = ondemand s3_read_write_resource=arn:aws:s3:::fds-smv-bucket-unique* placement_group = DYNAMIC placement = compute base_os = alinux2 tags = {"Name" : "fds-smv"} disable_hyperthreading = true fsx_settings = fsxshared enable_efa = compute dcv_settings = hpc-dcv [dcv hpc-dcv] enable = m ster [fsx fsxshared] shared_dir = /fsx storage_capacity = 1200 import_path = s3://fds-smv-bucket-unique imported_file_chunk_size = 1024 export_path = s3://fds-smv-bucket-unique [aliases] ssh = ssh {CFN_USER}@{M STER_IP} {ARGS} ``` I am unable to create a YAML file that will be accepted by Parallel Cluster to create-cluster. It returns the error: ``` { "message": "Bad Request: Configuration must be a valid YAML document" } ``` I attempted to create a YAML file using AWS Parallel Cluster configure wizard (https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-configuring.html) but it does not have all the specifications, like a shared s3 bucket in Fsx, like the tutorial asks for. I tried starting with the wizard created configuration file and editing it so it looks like the YAML files in documentation, but it still returns the same error. See my edited YAML file here: ``` Region: us-east-2 Image: Os: alinux2 HeadNode: InstanceType: c5.xlarge Networking: SubnetId: subnet-032f3e6409362aff2 Ssh: KeyName: MyKeyPair1 DisableSimultaneousMultithreading: true Scheduling: Scheduler: slurm SlurmQueues: - Name: queue1 CapacityType: ONDEMAND ComputeResources: - Name: c5n18xlarge InstanceType: c5n.18xlarge MinCount: 0 MaxCount: 100 Efa: Enabled: true Networking: SubnetIds: - subnet-032f3e6409362aff2 Iam: S3Access: - BucketName: fds-smv-jts-bucket-1 EnableWriteAccess: True AdditionalIamPolicies: - Policy: arn:aws:s3:::fds-smv-jts-bucket-1* SharedStorage: - MountDir: /fsx StorageType: FsxLustre FsxLustreSettings: StorageCapacity: 1200 ImportedFileChunkSize: 1024 ExportPath: s3://fds-smv-jts-bucket-1 ImportPath: s3://fds-smv-jts-bucket-1 Tags: - Key: String Value: fds-smv DevSettings ClusterTemplate: fds-smv-cluster ``` Any ideas on how to create the proper YAML file with all the data that is requested for the tutorial? Thank you!
1
answers
0
votes
3
views
Amazon-User-25
asked 4 months ago

Problems building 'Hello World CL VHDL Example'

I wanted to build this example on my EC2 instance (which was prepared from the AWS FGA Developer AL2 ami): * https://github.com/aws/aws-fpga/tree/master/hdk/cl/examples/cl_hello_world_vhdl I was following steps from this page: * https://github.com/aws/aws-fpga/blob/master/hdk/README.md#endtoend But I bumped into issues in the build output looking like this: ``` ERROR: [Common 17-141] Failed to write file content of top_sp.xbdc in zip archive. ... ... Abnormal program termination (11) Please check '/home/ec2-user/src/project_data/aws-fpga/hdk/cl/examples/cl_hello_world_vhdl/build/scripts/hs_err_pid11287.log' for details ``` If I take a look at the suggested `hs_err_pid11287.log` file, it contains this: ``` Stack: /lib64/libc.so.6(+0x33c90) [0x7ffa2444bc90] /opt/Xilinx/Vivado/2021.1/lib/lnx64.o/libtcmalloc.so.4(tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0xf3) [0x7ffa263a8973] ``` I noticed there were a few other forum threads with similar errors: * https://forums.aws.amazon.com/thread.jspa?threadID=256609 * https://forums.aws.amazon.com/thread.jspa?threadID=261665&tstart=0 I saw a mention of trying to fiddle with the value of a `resource_sharing` parameter, but the thread didn't mention where to tweak such a setting. Hunting around, I thought perhaps I needed to edit the `$HDK_DIR/common/shell_v04261818/build/scripts/strategy_TIMING.tcl` file, then tweak a line in there to change from `-resource_sharing off` to `-resource_sharing auto`, and then try re-build with `./aws_build_dcp_from_cl.sh -strategy TIMING`. But nope, got a similar error as above, though this time, the 'Failed to write' error related to a `top_sp.edf` file rather than the `top_sp.xbdc` file mentioned earlier. Wondering what else can I try?
4
answers
0
votes
8
views
Gurce
asked 4 months ago

unknown slowdown in parallelcluster

I've noticed that the amount of time to complete the jobs in my task array varies dramatically. Any idea what is causing it? The speed of the jobs seems very acceptable for the first jobs, but then something goes wrong ... ? I'm using the slurm scheduler 20.11.8 and aws parallelcluster 3.0.2. Below are 2 examples showing the variation in time/job. I plot the time (in seconds) it takes for each job/task (each job is a dot). (I couldn't see how to attach an image, so I'm providing links.) example 1: 800 jobs [https://ibb.co/KrrwhXn](https://ibb.co/KrrwhXn) You can see that the first ~400 tasks complete in roughly 400 seconds per job, and then jobs 400 to 750 take about 6000 seconds. example 2: 300 jobs: [https://ibb.co/4RdTpzg](https://ibb.co/4RdTpzg) You can see that the first 50 jobs run slower than jobs 50-150, and then jobs 150-200 are slowest. In both cases I'm running 50 nodes at a time. It seems like the duration of the job is related to the number of jobs each instance has run. In other words, the speed of the task often changes considerably at each multiple of 50. When I change the number of nodes running at a time, I still observe this pattern. Each job is basically equal in the amount of "work" there is to do (within 5%), so it's *not* the case, for example, that jobs 150-200 in example 2 are "harder" than the other jobs. Actually the 2 examples above are the exact same jobs (but in example 2 I only ran the first 300 of 800 jobs). What I've tried: 1. I've used different instance types, but I observe this slowdown across all instance types 2. I've used different number of nodes, but whether I use 20, 40, or 50, I observe this slowdown. 3. I've observed the CPU and memory usage in both the head node and nodes in the compute fleet, and it seems reasonable. when I use -top- to monitor, the highest usage process generally is using less than 1% of memory and 1% of CPU. 4. I've explored these logs in the **head** node, but I haven't found anything that's clearly wrong: * /var/log/cfn-init.log * /var/log/chef-client.log * /var/log/parallelcluster/slurm_resume.log * /var/log/parallelcluster/slurm_suspend.log * /var/log/parallelcluster/clustermgtd * /var/log/slurmctld.log 5. I've explored these logs in the **compute** node, but I haven't found anything that's clearly wrong: * /var/log/cloud-init-output.log * /var/log/parallelcluster/computemgtd * /var/log/slurmd.log Here's my configuration file: ``` Region: us-east-1 Image: Os: alinux2 HeadNode: CustomActions: OnNodeConfigured: Script: s3://my-bucket/head.sh InstanceType: t2.medium Networking: SubnetId: [snip] Ssh: KeyName: [snip] Scheduling: Scheduler: slurm SlurmQueues: - Name: queue1 ComputeResources: - Name: t2medium InstanceType: t2.medium MinCount: 0 MaxCount: 101 Networking: SubnetIds: - subnet-[snip] CustomActions: OnNodeConfigured: Script: s3://my-bucket/node.sh ``` I'm limiting the number of nodes running (50) in the following way: ``` #!/bin/sh #SBATCH --partition queue1 #SBATCH --array=1-800%50 #SBATCH --nice=100 ```
3
answers
0
votes
6
views
AWS-User-1494532
asked 5 months ago

Can't start up an EC2 with F1 instance-type

I've been wanting to explore FPGA development via the AWS FPGA Developer AMI on the marketplace. While following through a tutorial, there came a point where I needed the instance to have a `/dev/xfpga` device on it, which I believe exists only on the `F1` instance-types. So I turned off my EC2 instance, and tried switching the instance-type to `f1.2xlarge`. It seemed to accept the change of instance-type fine at this point. But when I tried to right-click and 'Start Instance', I was met with an error message of: ``` Failed to start the instance i-056d8b0407f711785 The requested configuration is currently not supported. Please check the documentation for supported configurations. ``` Googling online, a few possible reasons were suggested. The one I suspect is most likely is that perhaps the F1 instance-types aren't supported in my local region (Sydney / ap-southeast-2) yet. I do recall reading an article online that mentioned F1 instance-types only being available in US-regions: - https://aws.amazon.com/about-aws/whats-new/2017/11/amazon-ec2-f1-instances-are-now-available-in-aws-govcloud--us/ But that article is from 2017, and I assumed that perhaps F1 instance-types would be globally available by now, and since nothing stopped me from selecting it while in Sydney region, it added to that impression. But anyways, thought I'd ask here, in-case anyone has any insights into this. For my part, I will try spin up an EC2 instance in a US-region and see if I have any luck there.
1
answers
0
votes
7
views
Gurce
asked 5 months ago
2
answers
0
votes
0
views
AWS-User-0014169
asked 7 months ago

AWS Parallel cluster compute nodes failing to start properly

Hello, I am a new parallelCluster 2.11 user and am having an issue where my compute nodes fail to spin up properly resulting in the eventual failure of pcluster create. Here is my config file: Note: I replaced square brackets with curly braces because aws forums recognizes square brackets as links {aws} aws_region_name = us-east-1 {aliases} ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS} {global} cluster_template = default update_check = true sanity_check = true {cluster default} key_name = <my-keypair> scheduler = slurm master_instance_type = c5n.2xlarge base_os = centos7 vpc_settings = default queue_settings = compute master_root_volume_size = 1000 compute_root_volume_size = 35 {vpc default} vpc_id = <my-default-vpc-id> master_subnet_id = <my-subneta> compute_subnet_id = <my-subnetb> use_public_ips = false {queue compute} enable_efa = true compute_resource_settings = default compute_type = ondemand placement_group = DYNAMIC disable_hyperthreading = true {compute_resource default} instance_type = c5n.18xlarge initial_count = 1 min_count = 1 max_count = 32 {ebs shared} shared_dir = shared volume_type = st1 volume_size = 500 When I run pcluster create I get the following error after ~15 min: The following resource(s) failed to create: MasterServer. - AWS::EC2::Instance MasterServer Failed to receive 1 resource signal(s) within the specified duration If I log into the master node before the failure above I see the following in the /var/log/parallelcluster/clustermgtd log file: 2021-09-28 15:42:41,168 - slurm_plugin.clustermgtd:_maintain_nodes - INFO - Found the following unhealthy static nodes: (x1) 'compute-st-c5n18xlarge-1(compute-st-c5n18xlarge-1)' 2021-09-28 15:42:41,168 - slurm_plugin.clustermgtd:_handle_unhealthy_static_nodes - INFO - Setting unhealthy static nodes to DOWN However, despite setting the node to DOWN, the ec2 compute instance continues to stay in the running state and the above log continually emits the following message: 2021-09-28 15:54:41,156 - slurm_plugin.clustermgtd:_maintain_nodes - INFO - Following nodes are currently in replacement: (x1) 'compute-st-c5n18xlarge-1' This state persists until the pcluster create command fails with the error noted above. I suspect there is something wrong with my configuration -- any help or further troubleshooting advice would be appreciated. Edited by: notknottheory on Sep 28, 2021 9:19 AM
2
answers
0
votes
0
views
notknottheory
asked 8 months ago

How to update cluster config when the original ebs snapshot is gone

Hi, I have a cluster configured with ParallelCluster 2.10 that has been for over half a year now. It has two ebs resources mounted /shared and /install. It seems that both the ebs snapshots associated with the mounting points have been deleted. This should not be an issue, since the snapshots are used only for the initialization of the cluster, however, when I am trying to update the configuration of the cluster now - simply adding some compute nodes(bumping the max_queue_size), I am facing the following error message: <code> (venv_aws) > pcluster update flacscloudHPC-2-10-0 -c ./config_flacscloudHPC Retrieving configuration from CloudFormation for cluster flacscloudHPC-2-10-0... Validating configuration file ./config_flacscloudHPC... WARNING: The configuration parameter 'scheduler' generated the following warnings: The job scheduler you are using (torque) is scheduled to be deprecated in future releases of ParallelCluster. More information is available here: https://github.com/aws/aws-parallelcluster/wiki/Deprecation-of-SGE-and-Torque-in-ParallelCluster ERROR: The section \[ebs custom2] is wrongly configured The snapshot snap-0870f8601759ca239 does not appear to exist: The snapshot 'snap-0870f8601759ca239' does not exist. </code> How can I update the max_queue_size without having the original snapshod 'snap-0870f8601759ca239'? Is it safe to forcefully reconfigure the cluster with some updated, existing snapshots?
2
answers
0
votes
0
views
mfolusiak1
asked 8 months ago

torque service exits with status 3 on master node

Hi, I noticed strange behavior of my cluster. I am using torque on centos 8. The cluster was working fine for over 2 months and all of the sudden the compute nodes stopped running queued jobs. I tried restarting the compute fleet but this didn't help and I found out that the torque service on the master node had failed and I am not able to restart it (see listing below). What can I do to repair my cluster? I am using parallelcluster 2.10 with a custom AMI and maximum 12 nodes with 8 processors on each (c5.4xlarge without hyperthreading). \[code] \[centos@ip-172-31-24-41 ~]$ sudo service --status-all Usage: /etc/init.d/ec2blkdev {start|stop} \u25cf munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago Docs: man:munged(8) Main PID: 5974 (munged) Tasks: 4 (limit: 47239) Memory: 3.5M CGroup: /system.slice/munge.service \u2514\u25005974 /usr/sbin/munged \u25cf pbs_sched.service - SYSV: PBS is a batch versatile batch system for SMPs and clusters Loaded: loaded (/etc/rc.d/init.d/pbs_sched; generated) Active: active (exited) since Tue 2020-12-01 22:04:05 UTC; 2 months 17 days ago Docs: man:systemd-sysv-generator(8) Tasks: 0 (limit: 47239) Memory: 0B CGroup: /system.slice/pbs_sched.service \u25cf pbs_server.service - TORQUE pbs_server daemon Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2020-12-01 22:04:04 UTC; 2 months 17 days ago Main PID: 6173 (code=exited, status=3) jobwatcher RUNNING pid 6467, uptime 78 days, 17:53:12 sqswatcher RUNNING pid 6468, uptime 78 days, 17:53:11 \u25cf trqauthd.service - TORQUE trqauthd daemon Loaded: loaded (/usr/lib/systemd/system/trqauthd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago Main PID: 5900 (trqauthd) Tasks: 1 (limit: 47239) Memory: 3.1M CGroup: /system.slice/trqauthd.service \u2514\u25005900 /opt/torque/sbin/trqauthd -F \[centos@ip-172-31-24-41 ~]$ sudo service pbs_server restart Restarting pbs_server (via systemctl): \[ OK ] \[centos@ip-172-31-24-41 ~]$ sudo service --status-all Usage: /etc/init.d/ec2blkdev {start|stop} \u25cf munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago Docs: man:munged(8) Main PID: 5974 (munged) Tasks: 4 (limit: 47239) Memory: 3.5M CGroup: /system.slice/munge.service \u2514\u25005974 /usr/sbin/munged \u25cf pbs_sched.service - SYSV: PBS is a batch versatile batch system for SMPs and clusters Loaded: loaded (/etc/rc.d/init.d/pbs_sched; generated) Active: active (exited) since Tue 2020-12-01 22:04:05 UTC; 2 months 17 days ago Docs: man:systemd-sysv-generator(8) Tasks: 0 (limit: 47239) Memory: 0B CGroup: /system.slice/pbs_sched.service \u25cf pbs_server.service - TORQUE pbs_server daemon Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2021-02-18 16:18:48 UTC; 7s ago Process: 2884631 ExecStart=/opt/torque/sbin/pbs_server -F -d $PBS_HOME $PBS_ARGS (code=exited, status=3) Main PID: 2884631 (code=exited, status=3) Feb 18 16:18:48 ip-172-31-24-41 systemd\[1]: Started TORQUE pbs_server daemon. Feb 18 16:18:48 ip-172-31-24-41 pbs_server\[2884631]: pbs_server port already bound: Address already in use Feb 18 16:18:48 ip-172-31-24-41 systemd\[1]: pbs_server.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Feb 18 16:18:48 ip-172-31-24-41 systemd\[1]: pbs_server.service: Failed with result 'exit-code'. jobwatcher RUNNING pid 6467, uptime 78 days, 18:14:44 sqswatcher RUNNING pid 6468, uptime 78 days, 18:14:43 \u25cf trqauthd.service - TORQUE trqauthd daemon Loaded: loaded (/usr/lib/systemd/system/trqauthd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-12-01 22:03:49 UTC; 2 months 17 days ago Main PID: 5900 (trqauthd) Tasks: 1 (limit: 47239) Memory: 3.1M CGroup: /system.slice/trqauthd.service \u2514\u25005900 /opt/torque/sbin/trqauthd -F \[/code]
2
answers
0
votes
0
views
mfolusiak
asked a year ago

torque nodes overloaded with TSK greater than NP

Hello, I noticed that nodes in my cluster tend to overcommit and are overloaded running more torque jobs than the number of available CPUs. I suspect it may be related to the torque configuration (or maybe it doesn't respect hyperthreading somehow?) I am using parallelcluster 2.10 with a custom AMI and maximum 12 nodes with 8 processors on each (c5.4xlarge without hyperthreading). The node I would be analyzing here is **ip-172-31-68-184** This is the qnodes output for this node, should be allowing up to np=8 CPUs \[code] $ qnodes ... ip-172-31-68-184 state = free power_state = Running np = 8 ntype = cluster jobs = 0/218.ip-172-31-24-41.eu-central-1.compute.internal,1/219.ip-172-31-24-41.eu-central-1.compute.internal,2/220.ip-172-31-24-41.eu-central-1.compute.internal,3/221.ip-172-31-24-41.eu-central-1.compute.internal,4/518.ip-172-31-24-41.eu-central-1.compute.internal status = opsys=linux,uname=Linux ip-172-31-68-184 4.18.0-193.28.1.el8_2.x86_64 #1 SMP Thu Oct 22 00:20:22 UTC 2020 x86_64,sessions=1182 1306 5674 6030 6039 6046 6062 112846,nsessions=8,nusers=4,idletime=166759,totmem=31720500kb,availmem=29305472kb,physmem=31720500kb,ncpus=8,loadave=18.33,gres=,netload=47638299866,state=free,varattr= ,cpuclock=Fixed,macaddr=02:5a:f2:25:37:ba,version=6.1.2,rectime=1612984963,jobs=218.ip-172-31-24-41.eu-central-1.compute.internal 219.ip-172-31-24-41.eu-central-1.compute.internal 220.ip-172-31-24-41.eu-central-1.compute.internal 221.ip-172-31-24-41.eu-central-1.compute.internal 518.ip-172-31-24-41.eu-central-1.compute.internal mom_service_port = 15002 mom_manager_port = 15003 \[/code] , whereas the qstat output for this node: \[code] Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time 218.ip-172-31-24-41.eu flacscloud batch 000038 6030 -- 4 -- 48:00:00 R 46:13:51 ip-172-31-68-184/0 219.ip-172-31-24-41.eu flacscloud batch 000039 6039 -- 4 -- 48:00:00 R 46:13:51 ip-172-31-68-184/1 220.ip-172-31-24-41.eu flacscloud batch 000056 6046 -- 4 -- 48:00:00 R 46:13:51 ip-172-31-68-184/2 221.ip-172-31-24-41.eu flacscloud batch 000060 6062 -- 4 -- 48:00:00 R 46:13:51 ip-172-31-68-184/3 518.ip-172-31-24-41.eu flacscloud batch 012310 112846 -- 2 -- 48:00:00 R 23:16:18 ip-172-31-68-184/4 \[/code] it is clear that sum of TSK for running jobs is greater than number of CPUs. This observation can be confirmed while running `top` on this node, the node is overloaded. Why would that happen and how can I fix this behavior? Edited by: mfolusiak on Feb 10, 2021 12:03 PM Edited by: mfolusiak on Feb 10, 2021 1:09 PM
3
answers
0
votes
0
views
mfolusiak
asked a year ago

"pcluster create" results in error during startup - How to address ?

**I am attempting to create a small Parallel cluster using OSX and anaconda. I installed and configured parallelcluster with no errors. The resulting simple config file looks like:** ``` [aws] aws_region_name = us-east-1 [global] cluster_template = default update_check = true sanity_check = true [aliases] ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS} [cluster default] key_name = newmac base_os = ubuntu1604 scheduler = sge initial_queue_size = 2 max_queue_size = 2 maintain_initial_size = true vpc_settings = default [vpc default] vpc_id = vpc-058cb4b123bf54848 master_subnet_id = subnet-04998fb2f4bc80ccc compute_subnet_id = subnet-03c7e26fb2ea81430 use_public_ips = false ``` **When I went to create the cluster here is what I see:** (base) esteban$ pcluster create -c /Users/esteban/.parallelcluster/config first-cluster Beginning cluster creation for cluster: first-cluster WARNING: The configuration parameter 'scheduler' generated the following warnings: The job scheduler you are using (sge) is scheduled to be deprecated in future releases of ParallelCluster. More information is available here: https://github.com/aws/aws-parallelcluster/wiki/Deprecation-of-SGE-and-Torque-in-ParallelCluster Creating stack named: parallelcluster-first-cluster Status: parallelcluster-first-cluster - ROLLBACK_IN_PROGRESS Cluster creation failed. Failed events: - AWS::AutoScaling::AutoScalingGroup ComputeFleet Resource creation cancelled - AWS::CloudFormation::WaitCondition MasterServerWaitCondition Received FAILURE signal with UniqueId i-0438bb35816b58aa5 **Okay so I can still login although there are no SGE binaries installed.** $ pcluster ssh first-cluster -i ~/.ssh/newmac.pem **So I just deleted the cluster which appears to have (I hope) cleaned up all the associated services.** pcluster delete -c /Users/esteban/.parallelcluster/config first-cluster Any ideas ? Thanks Edited by: Stevie on Jul 23, 2020 11:42 AM Edited by: Stevie on Jul 23, 2020 11:46 AM Edited by: Stevie on Jul 23, 2020 11:47 AM
3
answers
0
votes
0
views
Stevie
asked 2 years ago

EFA not working with CentOS 7 and Intel MPI

I have been trying to resolve this problem all week. I followed all the steps to create a 2 instance system with EFA drivers installed. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html I started with the official CentOS 7 HVM image in the marketplace and updated it before installing the efa driver. I have separately and on different images, tried both IntelMPI 2019.5 and 2019.6 (I installed them with the provided aws_impi.sh that points to the amazon efa driver library) Any test I run hangs with efa enabled, the jobs start on both instances but MPI communication is not occurring. export FI_PROVIDER=efa Things run fine using the sockets fabric though. export FI_PROVIDER=sockets c5n.18xlarge instances. I am using an updated CentOS7 image with Kernel revision: 3.10.0-1062.4.1.el7.x86_64 (I was unable to get it working on the previous kernel version 957 also) I'm running in a subnet in a VPC with an attached EFS volume. efa driver version 1.4.1 I did not install or test with openmpi. The efa_test.sh fails, but it works if I use "-p sockets" instead of "-p efa". Another test example: #!/bin/bash module load mpi export FI_PROVIDER=efa #export FI_PROVIDER=sockets mpirun -np 2 -ppn 1 -f $PWD/hosts $PWD/mpi/pt2pt/osu_latency Additional details: fi_info: Instance 1: provider: efa fabric: EFA-fe80::a0:1eff:febf:9407 domain: efa_0-rdm version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::a0:1eff:febf:9407 domain: efa_0-dgrm version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_EFA provider: efa;ofi_rxd fabric: EFA-fe80::a0:1eff:febf:9407 domain: efa_0-dgrm version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD Instance2: provider: efa fabric: EFA-fe80::70:3bff:fee1:52ad domain: efa_0-rdm version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::70:3bff:fee1:52ad domain: efa_0-dgrm version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_EFA provider: efa;ofi_rxd fabric: EFA-fe80::70:3bff:fee1:52ad domain: efa_0-dgrm version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD ibstat CA 'efa_0' CA type: Number of ports: 1 Firmware version: Hardware version: Node GUID: 0x0000000000000000 System image GUID: 0x0000000000000000 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 1 SM lid: 0 Capability mask: 0x00000000 Port GUID: 0x00a01efffebf9407 Link layer: Unknown CA 'efa_0' CA type: Number of ports: 1 Firmware version: Hardware version: Node GUID: 0x0000000000000000 System image GUID: 0x0000000000000000 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 1 SM lid: 0 Capability mask: 0x00000000 Port GUID: 0x00703bfffee152ad Link layer: Unknown Some other errors I have seen but I'm not familiar with: ibsysstat -G 0x00703bfffee152ad ibwarn: \[24340] mad_rpc_open_port: can't open UMAD port ((null):0) ibsysstat: iberror: failed: Failed to open '(null)' port '0' ibhosts ibwarn: \[11769] mad_rpc_open_port: can't open UMAD port ((null):0) /codebuild/output/src613026668/src/rdmacore_build/BUILD/rdma-core-25.0/libibnetdisc/ibnetdisc.c:802; can't open MAD port ((null):0) /usr/sbin/ibnetdiscover: iberror: failed: discover failed Any ideas? Thanks - Patrick
3
answers
0
votes
0
views
patrickt
asked 3 years ago

Failed to setup parallel cluster on AWS EC2 with Ubuntu OS.

Hi! I am very new to AWS EC2 and AWS Parallel Cluster. I have two virtual compute nodes running on Ubuntu 18.04 with g3s.xlarge and c5.4xlarge instances. My goals are setting up parallel cluster (master & slave nodes) with SLURM job manager for running the calculation on multi-nodes with parallel method like MPI. So far, I have tried to create a new parallel cluster using **pcluster** tool by learning from a quick manual on README.md in aws-parallelcluster github repository and full AWS ParallelCluster manual, but I failed to do that. I have also tweaked the config file which is stored at $HOME/.parallelcluster folder, and even added **--norollback** option, but the errors still persist. _My modified config file:_ ``` [aws] aws_region_name = us-east-1 [cluster hpctest] key_name = XXXXXXXX base_os = alinux master_instance_type = g3s.xlarge master_root_volume_size = 64 compute_instance_type = c5.4xlarge compute_root_volume_size = 64 initial_queue_size = 0 max_queue_size = 8 maintain_initial_size = false custom_ami = ami-0fd18b144da8357b7 scheduler = slurm cluster_type = spot placement_group = DYNAMIC placement = cluster ebs_settings = shared fsx_settings = fs vpc_settings = public [ebs shared] shared_dir = shared volume_type = st1 volume_size = 500 [fsx fs] shared_dir = /fsx storage_capacity = 3600 [global] cluster_template = hpctest update_check = true sanity_check = true [vpc public] vpc_id = vpc-XXXXXXXX master_subnet_id = subnet-XXXXXXXX [aliases] ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS} [scailing custom] scaledown_idletime = 1 ``` Note: I take custom_ami id from <https://github.com/aws/aws-parallelcluster/blob/master/amis.txt>. _The errors I am facing with:_ ``` ubuntu@ip-172-XX-XX-XXX:~$ pcluster create t1 Beginning cluster creation for cluster: t1 Creating stack named: parallelcluster-t1 Status: parallelcluster-t1 - ROLLBACK_IN_PROGRESS Cluster creation failed. Failed events: - AWS::EC2::SecurityGroup MasterSecurityGroup Resource creation cancelled - AWS::EC2::PlacementGroup DynamicPlacementGroup Resource creation cancelled - AWS::EC2::EIP MasterEIP Resource creation cancelled - AWS::CloudFormation::Stack EBSCfnStack Resource creation cancelled - AWS::DynamoDB::Table DynamoDBTable Resource creation cancelled - AWS::IAM::Role RootRole API: iam:CreateRole User: arn:aws:iam::3043XXXXXXXX:user/nutt is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::3043XXXXXXXX:role/parallelcluster-t1-RootRole-1L9A3XXXXXXXX ``` Can anyone help me to solve this problem? If there is any previous threads asking as same as my questions or facing the same problems, please let me know so that I could start to learning with that. Thank you for your time! Rangsiman
2
answers
0
votes
0
views
rangsiman
asked 3 years ago

ParallelCluster, AWS Batch, Native libraries not found

I have a large MPI application written in fortran and attempting to get it running on pcluster with the awsbatch scheduler. The pcluster instance has an EFS drive mounted as /tiegcm_efs where pre-built native libraries are stored. The libraries were built on the master node of the cluster, so I was expecting that underlying dependencies, particularly openmpi, would be consistent between the master OS and the docker containers used in the runtime environment. I'm using this page as a model for submitting and starting my MPI job: https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/03_batch_mpi.html#running-your-first-job-using-aws-batch I used submit_mpi.sh as a starting point and have adjusted it for my case and to add a bunch of diagnostic output. Here are some key snippets from my submit_mpi.sh: ``` export LD_LIBRARY_PATH="/tiegcm_efs/dependencies/v20190320/lib:/usr/lib:/usr/lib64" echo "Libs" ls -l /tiegcm_efs/dependencies/v20190320/lib ls -l /usr/lib64/openmpi/lib echo "MPI" /usr/lib64/openmpi/bin/mpirun -V cd /tiegcm_efs/home/kimyx/tiegcm.exec ... echo "Running main..." /usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 --allow-run-as-root --machinefile "${HOME}/hostfile" ./tiegcm "${TGCMDATA}" ``` The pcluster job is submitted like this (I've done it with and without the -e; for now I'm hardcoding the two environment variables within submit_mpi.sh): ``` awsbsub -c tiegcm -n 2 -p 4 -e LD_LIBRARY_PATH,TGCMDATA -cf submit_mpi.sh ``` The resulting output #0 and #1 both show this: ``` 2019-03-20T16:00:24+00:00: ./tiegcm: error while loading shared libraries: libnetcdff.so.6: cannot open shared object file: No such file or directory 2019-03-20T16:00:24+00:00: ./tiegcm: error while loading shared libraries: libmpi_usempi.so.20: cannot open shared object file: No such file or directory ``` The libnetcdff.so.6 exists in /tiegcm_efs/dependencies/v20190320/lib but for some reason isn't being loaded. The following is from the ls command within the submit_mpi.sh script. ``` ls -l /tiegcm_efs/dependencies/v20190320/lib 2019-03-20T16:00:04+00:00: lrwxrwxrwx 1 1002 1005 19 Mar 20 03:28 libnetcdff.so -> libnetcdff.so.6.1.1 2019-03-20T16:00:04+00:00: lrwxrwxrwx 1 1002 1005 19 Mar 20 03:28 libnetcdff.so.6 -> libnetcdff.so.6.1.1 2019-03-20T16:00:04+00:00: -rwxr-xr-x 1 1002 1005 1448736 Mar 20 03:28 libnetcdff.so.6.1.1 ``` However, the libmpi_usempi.so.20 is _not_ found in the expected location /usr/lib64/openmpi/lib, even though all the systems are running Open MPI 2.1.1. The closest matching files within docker are: ``` ls -l /usr/lib64/openmpi/lib 2019-03-20T15:21:51+00:00: lrwxrwxrwx 1 root root 35 Mar 18 17:22 libmpi_usempi_ignore_tkr.so -> libmpi_usempi_ignore_tkr.so.20.10.0 2019-03-20T15:21:51+00:00: lrwxrwxrwx 1 root root 35 Mar 18 17:22 libmpi_usempi_ignore_tkr.so.20 -> libmpi_usempi_ignore_tkr.so.20.10.0 2019-03-20T15:21:51+00:00: -rwxr-xr-x 1 root root 23216 Aug 29 2018 libmpi_usempi_ignore_tkr.so.20.10.0 2019-03-20T15:21:51+00:00: lrwxrwxrwx 1 root root 27 Mar 18 17:22 libmpi_usempif08.so -> libmpi_usempif08.so.20.10.0 2019-03-20T15:21:51+00:00: lrwxrwxrwx 1 root root 27 Mar 18 17:22 libmpi_usempif08.so.20 -> libmpi_usempif08.so.20.10.0 2019-03-20T15:21:51+00:00: -rwxr-xr-x 1 root root 200216 Aug 29 2018 libmpi_usempif08.so.20.10.0 ``` whereas on the master node outside of Docker the same directory has this: ``` lrwxrwxrwx 1 root root 24 Jan 7 12:54 libmpi_usempi.so -> libmpi_usempi.so.20.10.0 lrwxrwxrwx 1 root root 24 Jan 7 12:54 libmpi_usempi.so.20 -> libmpi_usempi.so.20.10.0 -rwxr-xr-x 1 root root 7344 Aug 29 2017 libmpi_usempi.so.20.10.0 ``` I see the Jan 7 date here; I don't _think_ I installed openmpi myself after creating the cluster. Unlike the sample MPI program I can't compile my big application when the job starts within Docker. For one thing, **gmake** isn't installed within the docker container. For another, it takes a long time to build all the dependencies. To be clear, this application runs fine (but slowly) when I run it directly on the master EC2 using the same files and LD_LIBRARY_PATH, but skipping the mpirun wrapper. Am I missing something about how to specify library search paths within the awsbatch environment? Let me know if you need any more details. Thanks, Kim Edited by: kimyx on Mar 20, 2019 10:42 AM
6
answers
0
votes
0
views
kimyx
asked 3 years ago

ParallelCluster and AWS Batch

I'm new to using ParallelCluster. Have it set up in an AWS VPC and running test jobs successfully using the traditional pcluster scheduler. Now I'm setting it up with the AWS Batch back-end. I added this to ~/.parallelcluster/config: ``` [cluster awsbatch] base_os = alinux scheduler = awsbatch vpc_settings = public key_name = swt_kk... compute_instance_type = c5.xlarge [vpc public] master_subnet_id = subnet-049f... compute_subnet_id = subnet-049f... vpc_id = vpc-044b... ``` Following instructions here: https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/03_batch_mpi.html ``` source ~/envs/pcluster-virtualenv/bin/activate pcluster create awsbatch --cluster-template awsbatch ``` the cluster creates ok and I can see the master and compute nodes running in the EC2 console, but pcluster cannot see the compute node: ``` (pcluster-virtualenv) [kk@ip-172-16-0-10 ~]$ pcluster instances awsbatch MasterServer i-065da40163ecebe4a (pcluster-virtualenv) [kk@ip-172-16-0-10 ~]$ awsbhosts --cluster awsbatch ec2InstanceId instanceType privateIpAddress publicIpAddress runningJobs --------------- -------------- ------------------ ----------------- ------------- ``` and if I start a test job it just sits in the queue. The master and compute subnets are the same and have an internet gateway attached. I see this comment in the tutorial page: ``` # Replace with id of the subnet for the Compute nodes. # A NAT Gateway is required for MNP. ``` Is a NAT Gateway still required if an Internet Gateway is already in place? Must the compute subnet be different from the master subnet? Any ideas on what might be going wrong? Ways to debug? Thanks, Kim
3
answers
0
votes
0
views
kimyx
asked 3 years ago
  • 1
  • 90 / page