AWS EKS - EIA attached on node not reachable by Pod

0

I'm using a standard AWS EKS cluster, all cloud based (K8S 1.22) with multiple node groups, one of which uses a Launch Template that defines an Elastic Inference Accelerator attached to the instances (eia2.medium) to serve some kind of Tensorflow model.

I've been struggling a lot to make our Deep Learning model working at all while deployed, namely I have a Pod in a Deployment with a Service Account and an EKS IRSA policy attached, that is based on AWS Deep Learning Container for inference model serving based on Tensorflow 1.15.0.

The image used is 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference-eia:1.15.0-cpu and when the model is deployed in the cluster, with a node affinity to the proper EIA-enabled node, it simply doesn't work when called using /invocations endpoint:

Using Amazon Elastic Inference Client Library Version: 1.6.3
Number of Elastic Inference Accelerators Available: 1
Elastic Inference Accelerator ID: eia-<id>
Elastic Inference Accelerator Type: eia2.medium
Elastic Inference Accelerator Ordinal: 0

2022-05-11 13:47:17.799145: F external/org_tensorflow/tensorflow/contrib/ei/session/eia_session.cc:1221] Non-OK-status: SwapExStateWithEI(tmp_inputs, tmp_outputs, tmp_freeze) status: Internal: Failed to get the initial operator <redacted>list from server.
WARNING:__main__:unexpected tensorflow serving exit (status: 134). restarting.

Just as a reference, when using the CPU-only image available at 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:1.15.0-cpu, the model serves perfectly in any environment (locally too), of course with much longer computational time. Along with this, if i deploy a single EC2 instance with the attached EC2, and serve the container using a simple Docker command, the EIA works fine and is accessed correctly by the container.

Each EKS node and the Pod itself (via IRSA) has the following policy attached:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "elastic-inference:Connect",
                "iam:List*",
                "iam:Get*",
                "ec2:Describe*",
                "ec2:Get*",
                "ec2:ModifyInstanceAttribute"
            ],
            "Resource": "*"
        }
    ]
}

as per documentation from AWS itself, also i have created a VPC Endpoint for Elastic Inference as described by AWS and binded it to the private subnets used by EKS nodes along with a properly configured Security Group which allows SSH, HTTPS and 8500/8501 TCP ports from any worker node in the VPC CIDR.

Using both the AWS Reachability Analyzer and the IAM Policy Simulator nothing seems wrong and the networking and permissions seem fine, while also the EISetupValidator.py script provided by AWS says the same.

Any clue on what's actually happening here? Am i missing some kind of permissions or networking setup?

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions