- 最新
- 投票最多
- 评论最多
Have you taken a look at /var/log/parallelcluster/clustermgtd
on the HeadNode? While protected mode may not be your exact problem, this doc may help direct your debugging efforts.
Chris, thanks for the suggestion. I looked at clustermgtd and there is nothing that I can see that would indicate that the script was read during cluster configuration:
2023-03-30 16:37:12,798 - [slurm_plugin.clustermgtd:main] - INFO - ClusterManager Startup
2023-03-30 16:37:12,798 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-03-30 16:37:12,799 - [slurm_plugin.clustermgtd:set_config] - INFO - Applying new clustermgtd config: ClustermgtdConfig(_config=<configparser.ConfigParser object at 0x7f1f391953d0>, region='us-east-1', cluster_name='modbind-AD-modstore1', d
ynamodb_table='parallelcluster-slurm-modbind-AD-modstore1', head_node_private_ip='10.0.0.195', head_node_hostname='ip-10-0-0-195.ec2.internal', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, loop_time=60, di
sable_all_cluster_management=False, heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', boto3_config=<botocore.config.Config object at 0x7f1f33df3160>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/en
vs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_clustermgtd_logging.conf', disable_ec2_health_check=False, disable_scheduled_event_health_check=False, health_check_timeout=180, health_check_timeout_after_slur
mdstarttime=180, disable_all_health_checks=False, launch_max_batch_size=500, update_node_address=True, fleet_config={'modbind': {'modbind': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances':
[{'InstanceType': 'g4dn.metal'}], 'Networking': {'SubnetIds': ['subnet-659e6f4a']}}}, 'spot': {'spot': {'Api': 'create-fleet', 'CapacityType': 'spot', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'g4dn.metal'}], 'Network
ing': {'SubnetIds': ['subnet-659e6f4a']}}}}, run_instances_overrides={}, create_fleet_overrides={}, terminate_max_batch_size=1000, node_replacement_timeout=1800, terminate_drain_nodes=True, terminate_down_nodes=True, orphaned_instance_timeout=1
20, protected_failure_count=10, insufficient_capacity_timeout=600.0, disable_nodes_on_insufficient_capacity=True, hosted_zone='Z06359702HT9JMZIE8TK6', dns_domain='modbind-ad-modstore1.pcluster.', use_private_hostname=False, compute_console_logg
ing_enabled=True, compute_console_logging_max_sample_size=1, compute_console_wait_time=300, worker_pool_size=5, worker_pool_max_backlog=100)
2023-03-30 16:37:14,333 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: modbind-AD-modstore1-RoleHeadNode-1I8XZ48CF2G09
2023-03-30 16:37:14,904 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-03-30 16:37:14,989 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-03-30 16:37:15,277 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
......
2023-03-30 16:47:15,331 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2023-03-30 16:47:15,333 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2023-03-30 16:47:15,619 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Current compute fleet status: RUNNING
2023-03-30 16:47:15,619 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2023-03-30 16:47:20,720 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster
2023-03-30 16:47:20,797 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2023-03-30 16:47:20,797 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2023-03-30 16:47:20,797 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-03-30 16:47:20,798 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
Any other ideas?
Other way to narrow down the issue of install script would be :
Launch an parallel cluster without post install for a while. Once it is up, SSH into head node and run the scripts command manually as root user. If the commands runs successfully and exitcode is 0, we can be sure that there is no issue with commands.
As there is nothing in cfn-init logs which can say anything about issue. In that case there can be resource configuration issue as well.
i would request you to raise a support case with resource details. So this can be troubleshoot in right direction.
相关内容
- AWS 官方已更新 2 年前
- AWS 官方已更新 1 年前
Other way to narrow down the issue of install script would be :
Launch an parallel cluster without post install for a while. Once it is up, SSH into head node and run the scripts command manually as root user. If the commands runs successfully and exitcode is 0, we can be sure that there is no issue with commands.
As there is nothing in cfn-init logs which can say anything about issue. In that case there can be resource configuration issue as well.
i would request you to raise a support case with resource details. So this can be troubleshoot in right direction.