verify post-installation scripts run in AWS ParallelCluster

0

I uploaded a shell script to install conda and a specific environment post-configure on my compute nodes in AWS ParalellCluster. Either there were errors in the installation script or ParallelCluster had issues executing the script.

I didn't find any obvious answers in /var/log/cfn-init-cmd.log or /var/log/cfn-init.log. Below is the script that I was attempting to run:

#!/bin/bash

# script for setting up conda environment and installation of openMM on AWS ParallelCluster

# this script is called in the configuration script for pcluster during creation of the virtual cluster

# pull down miniconda and install it
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
/bin/bash Miniconda3-latest-Linux-x86_64.sh -b -p /shared/miniconda3
source ~/.bashrc

# install mamba to speed up package search, download, and installation
# use "-y" flag to allow for silent installation
conda install -c conda-forge mamba -y

# create the openMM environment by referencing the .yaml file in shared directory
mamba env create -f /shared/miniconda/openmm8.yaml -y

and below is my configuration script for generating the ParallelCluster:

HeadNode:
  InstanceType: c5.2xlarge
  Networking:
    SubnetId: subnet-xxxxxxxx
    AdditionalSecurityGroups:
      - sg-xxxxxxxx
  Ssh:
    KeyName: mertz_key
  LocalStorage:
    RootVolume:
      VolumeType: gp3
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: modbind
      AllocationStrategy: lowest-price
      ComputeResources:
        - Name: modbind
          Instances:
            - InstanceType: g4dn.metal
          MinCount: 0
          MaxCount: 10
          DisableSimultaneousMultithreading: true
          Efa:
            Enabled: true
      Networking:
        SubnetIds:
          - subnet-xxxxxxxx
        PlacementGroup:
          Enabled: true
        AdditionalSecurityGroups:
          - sg-xxxxxxxx
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
      CustomActions:
        OnNodeConfigured:
          Script: >-
            s3://parallelcluster-10552b48cfa2e9aa-v1-do-not-delete/modbind-cluster-setup.sh
    - Name: spot
      AllocationStrategy: lowest-price
      ComputeResources:
        - Name: spot
          Instances:
            - InstanceType: g4dn.metal
          MinCount: 0
          MaxCount: 10
          DisableSimultaneousMultithreading: true
          Efa:
            Enabled: true
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
      CapacityType: SPOT
      Networking:
        SubnetIds:
          - subnet-xxxxxxx
        PlacementGroup:
          Enabled: true
        AdditionalSecurityGroups:
          - sg-xxxxxxx
      CustomActions:
        OnNodeConfigured:
          Script: >-
            s3://parallelcluster-10552b48cfa2e9aa-v1-do-not-delete/modbind-cluster-setup.sh
  SlurmSettings: {}
Region: us-east-1
Image:
  Os: centos7
DirectoryService:
  DomainName: modulus.ad.com
  DomainAddr: ldaps://modulus.ad.com
  PasswordSecretArn: >-
    arn:aws:secretsmanager:us-east-1:xxxxxxxxxxxx:secret:PasswordSecret-modulus-AD-xxxxxx
  DomainReadOnlyUser: cn=ReadOnlyUser,ou=Users,ou=MODULUS,dc=modulus,dc=ad,dc=com
SharedStorage:
  - Name: Efs0
    StorageType: Efs
    MountDir: /shared
    EfsSettings:
      FileSystemId: fs-xxxxxxxx
  • Other way to narrow down the issue of install script would be :

    Launch an parallel cluster without post install for a while. Once it is up, SSH into head node and run the scripts command manually as root user. If the commands runs successfully and exitcode is 0, we can be sure that there is no issue with commands.

    As there is nothing in cfn-init logs which can say anything about issue. In that case there can be resource configuration issue as well.

    i would request you to raise a support case with resource details. So this can be troubleshoot in right direction.

blakem
asked a year ago470 views
3 Answers
0

Have you taken a look at /var/log/parallelcluster/clustermgtd on the HeadNode? While protected mode may not be your exact problem, this doc may help direct your debugging efforts.

AWS
answered a year ago
0

Chris, thanks for the suggestion. I looked at clustermgtd and there is nothing that I can see that would indicate that the script was read during cluster configuration:

2023-03-30 16:37:12,798 - [slurm_plugin.clustermgtd:main] - INFO - ClusterManager Startup
2023-03-30 16:37:12,798 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-03-30 16:37:12,799 - [slurm_plugin.clustermgtd:set_config] - INFO - Applying new clustermgtd config: ClustermgtdConfig(_config=<configparser.ConfigParser object at 0x7f1f391953d0>, region='us-east-1', cluster_name='modbind-AD-modstore1', d
ynamodb_table='parallelcluster-slurm-modbind-AD-modstore1', head_node_private_ip='10.0.0.195', head_node_hostname='ip-10-0-0-195.ec2.internal', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, loop_time=60, di
sable_all_cluster_management=False, heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', boto3_config=<botocore.config.Config object at 0x7f1f33df3160>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/en
vs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_clustermgtd_logging.conf', disable_ec2_health_check=False, disable_scheduled_event_health_check=False, health_check_timeout=180, health_check_timeout_after_slur
mdstarttime=180, disable_all_health_checks=False, launch_max_batch_size=500, update_node_address=True, fleet_config={'modbind': {'modbind': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': 
[{'InstanceType': 'g4dn.metal'}], 'Networking': {'SubnetIds': ['subnet-659e6f4a']}}}, 'spot': {'spot': {'Api': 'create-fleet', 'CapacityType': 'spot', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'g4dn.metal'}], 'Network
ing': {'SubnetIds': ['subnet-659e6f4a']}}}}, run_instances_overrides={}, create_fleet_overrides={}, terminate_max_batch_size=1000, node_replacement_timeout=1800, terminate_drain_nodes=True, terminate_down_nodes=True, orphaned_instance_timeout=1
20, protected_failure_count=10, insufficient_capacity_timeout=600.0, disable_nodes_on_insufficient_capacity=True, hosted_zone='Z06359702HT9JMZIE8TK6', dns_domain='modbind-ad-modstore1.pcluster.', use_private_hostname=False, compute_console_logg
ing_enabled=True, compute_console_logging_max_sample_size=1, compute_console_wait_time=300, worker_pool_size=5, worker_pool_max_backlog=100)
2023-03-30 16:37:14,333 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: modbind-AD-modstore1-RoleHeadNode-1I8XZ48CF2G09
2023-03-30 16:37:14,904 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-03-30 16:37:14,989 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-03-30 16:37:15,277 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING

......

2023-03-30 16:47:15,331 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-03-30 16:47:15,333 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-03-30 16:47:15,619 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-03-30 16:47:15,619 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2023-03-30 16:47:20,720 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster
2023-03-30 16:47:20,797 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2023-03-30 16:47:20,797 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2023-03-30 16:47:20,797 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-03-30 16:47:20,798 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance

Any other ideas?

blakem
answered a year ago
0

Other way to narrow down the issue of install script would be :

Launch an parallel cluster without post install for a while. Once it is up, SSH into head node and run the scripts command manually as root user. If the commands runs successfully and exitcode is 0, we can be sure that there is no issue with commands.

As there is nothing in cfn-init logs which can say anything about issue. In that case there can be resource configuration issue as well.

i would request you to raise a support case with resource details. So this can be troubleshoot in right direction.

AWS
SUPPORT ENGINEER
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions