By using AWS re:Post, you agree to the Terms of Use
All Questions
Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Ubuntu Managed Nodes creation failure in Fully-Private cluster

Hi, For some reason I am not able to create Ubuntu managed nodes in fully private cluster. Though, managed Amazon-Linux nodes and all other self-managed nodes are joining the cluster successfully. I have followed all the guides and troubleshooting aws websites already still I am not successful. I have also run the troubleshooting script. Below is the result ``` HERE IS A SUMMARY OF THE ITEMS THAT REQUIRE YOUR ATTENTION: [WARNING]: Worker node's AMI ami-0ebb49de26355a371 differs from the public EKS Optimized AMIs. Ensure that the Kubelet daemon is at the same version as your cluster's version 1.23 or only one minor version behind. Please review this URL for further details: https://kubernetes.io/releases/version-skew-policy/ . [WARNING]: No secondary private IP addresses are assigned to worker node i-0e775ca75fe57bb70, ensure that the CNI plugin is running properly. Please review this URL for further details: https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html [WARNING]: As SSM agent is not reachable on worker node, this document did not check the status of Containerd, Docker and Kubelet daemons. Ensure that required daemons (containerd, docker, kubelet) are running on the worker node using command "systemctl status <daemon-name>". ============================================================================================================================================ Here are the detailed steps of the document execution: [X] Checking EKS cluster test-cluster: EKS Cluster: test-cluster is in Active state. 1. Checking if the cluster Security Group is allowing traffic from the worker node: Passed: The cluster Security Group sg-068d83e5a68aff9c7 is allowing traffic from the worker node. 2. Checking DHCP options of the cluster VPC: Passed: AmazonProvidedDNS is enabled 3. Checking cluster IAM role arn:aws:iam::782534010321:role/eksctl-test-cluster-cluster-ServiceRole-9WPNFE2Q3N2T for the required permissions: Passed: IAM role for cluster test-cluster has the required IAM policies attached. Passed: The cluster IAM role arn:aws:iam::782534010321:role/eksctl-test-cluster-cluster-ServiceRole-9WPNFE2Q3N2T has the required trust relationship for the EKS service. 4. Checking control plane Elastic Network Interfaces(ENIs) in the cluster VPC: Passed: The cluster Elastic Network Interfaces(ENIs) exist. 5. Cluster Endpoint Private access is disabled for your cluster, checking if the Public CIDR ranges include worker node i-0e775ca75fe57bb70 outbound IP: Passed: The cluster allows public access from 0.0.0.0/0 6. Checking cluster VPC for required DNS attributes: Passed: Cluster VPC vpc-0cbb4879c588fb52a has the required DNS attributes correctly set. -------------------------------------------------------------------------------------------------------------------------------------------- [X] Checking worker node i-0e775ca75fe57bb70 state: The instance is Running. 1. Checking if the EC2 instance family is supported: Passed: EC2 instance family m5.xlarge is supported. 2. Checking the worker node network configuration: Passed: Worker node is created in a private subnet without a NAT Gateway so VPC endpoints need to be used. Passed: Checking VPC Endpoints setup: Passed: The VPC Endpoint com.amazonaws.eu-west-2.ec2 exists. Checking its configuration: Passed: Security groups [{'GroupId': 'sg-01294c96494e79aaa', 'GroupName': 'eksctl-test-cluster-cluster-ClusterSharedNodeSecurityGroup-1TIY3P78RYQOZ'}] applied to VPC Endpoint com.amazonaws.eu-west-2.ec2 is allowing the worker node to reach the endpoint. Passed: The default VPC Endpoint Policy is being used. Passed: The VPC Endpoint com.amazonaws.eu-west-2.ecr.api exists. Checking its configuration: Passed: Security groups [{'GroupId': 'sg-01294c96494e79aaa', 'GroupName': 'eksctl-test-cluster-cluster-ClusterSharedNodeSecurityGroup-1TIY3P78RYQOZ'}] applied to VPC Endpoint com.amazonaws.eu-west-2.ecr.api is allowing the worker node to reach the endpoint. Passed: The default VPC Endpoint Policy is being used. Passed: The VPC Endpoint com.amazonaws.eu-west-2.ecr.dkr exists. Checking its configuration: Passed: Security groups [{'GroupId': 'sg-01294c96494e79aaa', 'GroupName': 'eksctl-test-cluster-cluster-ClusterSharedNodeSecurityGroup-1TIY3P78RYQOZ'}] applied to VPC Endpoint com.amazonaws.eu-west-2.ecr.dkr is allowing the worker node to reach the endpoint. Passed: The default VPC Endpoint Policy is being used. Passed: The VPC Endpoint com.amazonaws.eu-west-2.sts exists. Checking its configuration: Passed: Security groups [{'GroupId': 'sg-01294c96494e79aaa', 'GroupName': 'eksctl-test-cluster-cluster-ClusterSharedNodeSecurityGroup-1TIY3P78RYQOZ'}] applied to VPC Endpoint com.amazonaws.eu-west-2.sts is allowing the worker node to reach the endpoint. Passed: The default VPC Endpoint Policy is being used. Passed: S3 gateway endpoint ['vpce-0e01febb999cf1735', 'vpce-08ef7ffac589c3d01'] is added to the worker's VPC. Passed: Worker node's route table rtb-0d26fefe5a613e64f has the required route for the S3 endpoint vpce-0e01febb999cf1735. Passed: Worker node's route table rtb-0d26fefe5a613e64f has the required route for the S3 endpoint vpce-08ef7ffac589c3d01. 3. Checking the IAM instance Profile of the worker node: Passed: The instance profile arn:aws:iam::782534010321:instance-profile/eks-d0c1cede-94e5-c6bc-75fd-2f86043a90eb is used with the worker node i-0e775ca75fe57bb70 . Passed: IAM role arn:aws:iam::782534010321:role/eksctl-test-cluster-nodegroup-ser-NodeInstanceRole-12LZWOL4EJAJX is attached to Instance Profile. Passed: IAM role arn:aws:iam::782534010321:role/eksctl-test-cluster-nodegroup-ser-NodeInstanceRole-12LZWOL4EJAJX has the required IAM policies attached. Passed: No issues detected with the trust relationship policy of the arn:aws:iam::782534010321:role/eksctl-test-cluster-nodegroup-ser-NodeInstanceRole-12LZWOL4EJAJX role. 4. Checking worker node's UserData bootstrap script: Passed: The UserData of the worker node contains the required bootstrap script. 5. Checking the worker node i-0e775ca75fe57bb70 tags: Passed: Worker node i-0e775ca75fe57bb70 has the required cluster tags. 6. Checking the AMI version for EC2 instance i-0e775ca75fe57bb70: [WARNING]: Worker node's AMI ami-0ebb49de26355a371 differs from the public EKS Optimized AMIs. Ensure that the Kubelet daemon is at the same version as your cluster's version 1.23 or only one minor version behind. Please review this URL for further details: https://kubernetes.io/releases/version-skew-policy/ . 7. Checking worker node i-0e775ca75fe57bb70 Elastic Network Interfaces(ENIs) and Private IP addresses to check if CNI is running: [WARNING]: No secondary private IP addresses are assigned to worker node i-0e775ca75fe57bb70, ensure that the CNI plugin is running properly. Please review this URL for further details: https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html 8. Checking the outbound SG rules for worker node i-0e775ca75fe57bb70 Passed: The Outbound security group rules for worker node i-0e775ca75fe57bb70 are sufficient to allow traffic to the EKS cluster endpoint 9. Checking if the worker node is running in AWS Outposts subnet Passed: Worker node's subnet subnet-092b2d53a74ac655f is not running in AWS Outposts 10. Checking basic NACL rules Passed: NACL acl-0507434ac9745b3dc has sufficient rules to allow cluster traffic. 11. Checking STS regional endpoint availability: Passed: STS endpoint is activated within region eu-west-2. 12. Checking if Instance Metadata http endpoint is enabled on the worker node: Passed: Instance metadata endpoint is enabled on the worker node. 13. Checking if SSM agent is running and reachable on worker node: [WARNING]: As SSM agent is not reachable on worker node, this document did not check the status of Containerd, Docker and Kubelet daemons. Ensure that required daemons (containerd, docker, kubelet) are running on the worker node using command "systemctl status <daemon-name>". ============================================================================================================================================ Here is a list of other possible causes that were NOT checked by this document: [-] Ensure that Instance IAM role is added to aws-auth configmap, please check: https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html. [-] If your account is a part of AWS Organizations Service, confirm that no Service Control Policy (SCP) is denying required permissions, please check: https://aws.amazon.com/premiumsupport/knowledge-center/eks-node-status-ready/. ``` One strange thing is that in point 5 it says private endpoint access is disabled but I have already created the cluster fully private. ``` { "update": { "id": "4c429db1-80bb-48d2-bf98-3f84524c0b83", "status": "Successful", "type": "EndpointAccessUpdate", "params": [ { "type": "EndpointPublicAccess", "value": "false" }, { "type": "EndpointPrivateAccess", "value": "true" }, { "type": "PublicAccessCidrs", "value": "[\"0.0.0.0/0\"]" } ], "createdAt": "2022-10-03T09:35:42.092000+00:00", "errors": [] } } ``` Also, when I run the command `sudo systemctl status kubelet` I get the result. `Unit kubelet.service could not be found.` My cluster config is as below ``` apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: test-cluster region: eu-west-2 version: "1.23" privateCluster: enabled: true vpc: id: vpc-ID subnets: private: hscn-private1-subnet: id: subnet-1 hscn-private2-subnet: id: subnet-2 managedNodeGroups: - name: serv-test-1 ami: ami-0ebb49de26355a371 instanceType: m5.xlarge desiredCapacity: 1 volumeType: gp2 volumeSize: 50 privateNetworking: true disableIMDSv1: true subnets: - hscn-private2-subnet ssh: allow: true tags: kubernetes.io/cluster/test-cluster: owned overrideBootstrapCommand: | #!/bin/bash /etc/eks/bootstrap.sh test-cluster --kubelet-extra-args '--node-labels=eks.amazonaws.com/nodegroup=serv-test-1,eks.amazonaws.com/nodegroup-image=ami-0ebb49de26355a371' --dns-cluster-ip 10.100.0.10 --apiserver-endpoint {My endpoint} --b64-cluster-ca {My-CA} ```
0
answers
0
votes
12
views
asked 13 hours ago

AWS Pytorch Neuron Compliation Error

I followed user guide on updating torch neuron and then started compiling the model to neuron. But got an error, from which I don't understand what's wrong. In Neuron SDK you claim that it should compile all operations, even not supported ones, they just should run on CPU. The error: ``` INFO:Neuron:All operators are compiled by neuron-cc (this does not guarantee that neuron-cc will successfully compile) INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 3345, fused = 3345, percent fused = 100.0% INFO:Neuron:Number of neuron graph operations 8175 did not match traced graph 9652 - using heuristic matching of hierarchical information INFO:Neuron:Compiling function _NeuronGraph$3362 with neuron-cc INFO:Neuron:Compiling with command line: '/home/ubuntu/alias/neuron/neuron_env/bin/neuron-cc compile /tmp/tmpmp8qvhtb/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpmp8qvhtb/graph_def.neff --io-config {"inputs": {"0:0": [[1, 3, 768, 768], "float32"]}, "outputs": ["aten_sigmoid/Sigmoid:0"]} --verbose 35' ..............................................................................INFO:Neuron:Compile command returned: -9 WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$3362; falling back to native python function call ERROR:Neuron:neuron-cc failed with the following command line call: /home/ubuntu/alias/neuron/neuron_env/bin/neuron-cc compile /tmp/tmpmp8qvhtb/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpmp8qvhtb/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 3, 768, 768], "float32"]}, "outputs": ["aten_sigmoid/Sigmoid:0"]}' --verbose 35 Traceback (most recent call last): File "/home/ubuntu/alias/neuron/neuron_env/lib/python3.7/site-packages/torch_neuron/convert.py", line 382, in op_converter item, inputs, compiler_workdir=sg_workdir, **kwargs) File "/home/ubuntu/alias/neuron/neuron_env/lib/python3.7/site-packages/torch_neuron/decorators.py", line 220, in trace 'neuron-cc failed with the following command line call:\n{}'.format(command)) subprocess.SubprocessError: neuron-cc failed with the following command line call: /home/ubuntu/alias/neuron/neuron_env/bin/neuron-cc compile /tmp/tmpmp8qvhtb/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpmp8qvhtb/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 3, 768, 768], "float32"]}, "outputs": ["aten_sigmoid/Sigmoid:0"]}' --verbose 35 INFO:Neuron:Number of arithmetic operators (post-compilation) before = 3345, compiled = 0, percent compiled = 0.0% INFO:Neuron:The neuron partitioner created 1 sub-graphs INFO:Neuron:Neuron successfully compiled 0 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 0.0% INFO:Neuron:Compiled these operators (and operator counts) to Neuron: INFO:Neuron:Not compiled operators (and operator counts) to Neuron: INFO:Neuron: => aten::Int: 942 [supported] INFO:Neuron: => aten::_convolution: 107 [supported] INFO:Neuron: => aten::add: 104 [supported] INFO:Neuron: => aten::batch_norm: 1 [supported] INFO:Neuron: => aten::cat: 1 [supported] INFO:Neuron: => aten::contiguous: 4 [supported] INFO:Neuron: => aten::div: 104 [supported] INFO:Neuron: => aten::dropout: 208 [supported] INFO:Neuron: => aten::feature_dropout: 1 [supported] INFO:Neuron: => aten::flatten: 60 [supported] INFO:Neuron: => aten::gelu: 52 [supported] INFO:Neuron: => aten::layer_norm: 161 [supported] INFO:Neuron: => aten::linear: 264 [supported] INFO:Neuron: => aten::matmul: 104 [supported] INFO:Neuron: => aten::mul: 52 [supported] INFO:Neuron: => aten::permute: 210 [supported] INFO:Neuron: => aten::relu: 1 [supported] INFO:Neuron: => aten::reshape: 262 [supported] INFO:Neuron: => aten::select: 104 [supported] INFO:Neuron: => aten::sigmoid: 1 [supported] INFO:Neuron: => aten::size: 278 [supported] INFO:Neuron: => aten::softmax: 52 [supported] INFO:Neuron: => aten::transpose: 216 [supported] INFO:Neuron: => aten::upsample_bilinear2d: 4 [supported] INFO:Neuron: => aten::view: 52 [supported] Traceback (most recent call last): File "to_neuron.py", line 14, in <module> model_neuron = torch.neuron.trace(model, example_inputs=[image.cuda()]) File "/home/ubuntu/alias/neuron/neuron_env/lib/python3.7/site-packages/torch_neuron/convert.py", line 184, in trace cu.stats_post_compiler(neuron_graph) File "/home/ubuntu/alias/neuron/neuron_env/lib/python3.7/site-packages/torch_neuron/convert.py", line 493, in stats_post_compiler "No operations were successfully partitioned and compiled to neuron for this model - aborting trace!") RuntimeError: No operations were successfully partitioned and compiled to neuron for this model - aborting trace! ```
0
answers
0
votes
2
views
asked 14 hours ago