By using AWS re:Post, you agree to the Terms of Use
/Amazon EC2/

Questions tagged with Amazon EC2

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

AWS EKS - EIA attached on node not reachable by Pod

I'm using a standard **AWS EKS** cluster, all cloud based (K8S 1.22) with multiple node groups, one of which uses a Launch Template that defines an Elastic Inference Accelerator attached to the instances (eia2.medium) to serve some kind of Tensorflow model. I've been struggling a lot to make our Deep Learning model working at all while deployed, namely I have a Pod in a Deployment with a Service Account and an **EKS IRSA** policy attached, that is based on AWS Deep Learning Container for inference model serving based on Tensorflow 1.15.0. The image used is `763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference-eia:1.15.0-cpu` and when the model is deployed in the cluster, with a node affinity to the proper EIA-enabled node, it simply doesn't work when called using /invocations endpoint: ``` Using Amazon Elastic Inference Client Library Version: 1.6.3 Number of Elastic Inference Accelerators Available: 1 Elastic Inference Accelerator ID: eia-<id> Elastic Inference Accelerator Type: eia2.medium Elastic Inference Accelerator Ordinal: 0 2022-05-11 13:47:17.799145: F external/org_tensorflow/tensorflow/contrib/ei/session/eia_session.cc:1221] Non-OK-status: SwapExStateWithEI(tmp_inputs, tmp_outputs, tmp_freeze) status: Internal: Failed to get the initial operator <redacted>list from server. WARNING:__main__:unexpected tensorflow serving exit (status: 134). restarting. ``` Just as a reference, when using the CPU-only image available at `763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:1.15.0-cpu`, the model serves perfectly in any environment (locally too), of course with much longer computational time. Along with this, if i deploy a single EC2 instance with the attached EC2, and serve the container using a simple Docker command, the EIA works fine and is accessed correctly by the container. Each EKS node and the Pod itself (via IRSA) has the following policy attached: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "elastic-inference:Connect", "iam:List*", "iam:Get*", "ec2:Describe*", "ec2:Get*", "ec2:ModifyInstanceAttribute" ], "Resource": "*" } ] } ``` as per documentation from AWS itself, also i have created a **VPC Endpoint for Elastic Inference** as described by AWS and binded it to the private subnets used by EKS nodes along with a properly configured **Security Group** which allows **SSH**, **HTTPS** and **8500/8501 TCP** ports from any worker node in the VPC CIDR. Using both the **AWS Reachability Analyzer** and the **IAM Policy Simulator** nothing seems wrong and the networking and permissions seem fine, while also the *EISetupValidator.py* script provided by AWS says the same. Any clue on what's actually happening here? Am i missing some kind of permissions or networking setup?
0
answers
0
votes
8
views
asked 2 days ago

We have 2 volumes can't detach and delete

We have 2 volumes(vol-63046619 and vol-2c076556) which need to be deleted. However, we can't delete them. They are not attached to any ec2 instances. The below are the command we tried. $ aws ec2 describe-volumes --region us-east-1 --volume-id vol-63046619 { "Volumes": [ { "AvailabilityZone": "us-east-1d", "Attachments": [], "Tags": [ { "Value": "", "Key": "Name" } ], "Encrypted": false, "VolumeType": "standard", "VolumeId": "vol-63046619", "State": "in-use", "SnapshotId": "snap-xxxxxxx", "CreateTime": "2012-10-01T20:29:01.000Z", "MultiAttachEnabled": false, "Size": 8 } ] } $ aws ec2 delete-volume --region us-east-1 --volume-id vol-63046619 An error occurred (IncorrectState) when calling the DeleteVolume operation: The volume 'vol-63046619' is 'in-use' $ aws ec2 detach-volume --region us-east-1 --volume-id vol-63046619 An error occurred (IncorrectState) when calling the DetachVolume operation: Volume 'vol-63046619' is in the 'available' state. $ aws ec2 describe-volumes --region us-east-1 --volume-id vol-2c076556 { "Volumes": [ { "AvailabilityZone": "us-east-1d", "Attachments": [], "Tags": [ { "Value": "xxxxxxxxxxxxx", "Key": "aws:cloudformation:stack-name" }, { "Value": "", "Key": "Name" }, { "Value": "xxxxxxxx", "Key": "aws:cloudformation:logical-id" }, { "Value": "arn:aws:cloudformation:us-east-1:xxxxxxxxxxxxx:stack/xxxxxxxxxx/xxxxxx-xxxx-xxxx-xxxx-xxxx", "Key": "aws:cloudformation:stack-id" } ], "Encrypted": false, "VolumeType": "standard", "VolumeId": "vol-2c076556", "State": "in-use", "SnapshotId": "", "CreateTime": "2012-10-01T20:28:41.000Z", "MultiAttachEnabled": false, "Size": 5 } ] } $ aws ec2 delete-volume --region us-east-1 --volume-id vol-2c076556 An error occurred (IncorrectState) when calling the DeleteVolume operation: The volume 'vol-2c076556' is 'in-use' $ aws ec2 detach-volume --region us-east-1 --volume-id vol-2c076556 An error occurred (IncorrectState) when calling the DetachVolume operation: Volume 'vol-2c076556' is in the 'available' state. $ We tried detach and force detach from the console too. But it just stuck and doesn't help this case.
2
answers
0
votes
33
views
asked 6 days ago

Run fleet with on demand instance across AZ

Hello, I wanted to start EC2 fleet with on-demand instances only, and I wanted them to be distributed across availability zones. Unfortunately, I couldn't find a way to do that, and all the instances are always started in a single AZ. That is not a problem with spot instances, as they spawn in all the AZ. I was trying to try different allocation strategies and priorities, but nothing helped. I was trying to do so in AWS-CDK, using both `CfnEC2Fleet` [link](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_ec2.CfnEC2Fleet.html) as well as `CfnSpotFleet` [link](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_ec2.CfnSpotFleet.html). Bellow is my code. Is there way how to achieve that, or do I need to use something else? Thank you. ```typescript const spotFleet = new CfnSpotFleet(stack, 'EC2-Fleet', { spotFleetRequestConfigData: { allocationStrategy: 'lowestPrice', targetCapacity: 8, iamFleetRole: fleetRole.roleArn, spotMaintenanceStrategies: { capacityRebalance: { replacementStrategy: 'launch-before-terminate', terminationDelay: 120, } }, onDemandTargetCapacity: 4, instancePoolsToUseCount: stack.availabilityZones.length, launchTemplateConfigs: [{ launchTemplateSpecification: { launchTemplateId: launchTemplate.launchTemplateId, version: launchTemplate.latestVersionNumber, }, overrides: privateSubnets.map(subnet => ({ availabilityZone: subnet.subnetAvailabilityZone, subnetId: subnet.subnetId, })), }], } }); const ec2Fleet = new CfnEC2Fleet(stack, 'EC2-EcFleet', { targetCapacitySpecification: { totalTargetCapacity: 6, onDemandTargetCapacity: 6, defaultTargetCapacityType: 'on-demand', }, replaceUnhealthyInstances: true, onDemandOptions: { allocationStrategy: 'prioritized', }, launchTemplateConfigs: [{ launchTemplateSpecification: { launchTemplateId: launchTemplate.launchTemplateId, version: launchTemplate.latestVersionNumber, }, overrides: privateSubnets.map(subnet => ({ availabilityZone: subnet.subnetAvailabilityZone, subnetId: subnet.subnetId, })), }] }); ``` Where `launchTemplate` is instance of [`LaunchTemplate`](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_ec2.LaunchTemplate.html) and `privateSubnets` is array of [`Subnet`](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_ec2.Subnet.html) instances, one for each AZ.
0
answers
0
votes
9
views
asked 7 days ago

XGBoost Error: Allreduce failed - 100GB Dask Dataframe on AWS Fargate ECS cluster dies with 1T of memory.

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster. Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties: Workers: 40 Total threads: 160 Total memory: 1 TB So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail. Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error `XGBoostError: rabit/internal/utils.h:90: Allreduce failed`. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe? ``` X_train = X_train.to_dask_array() X_test = X_test.to_dask_array() y_train = y_train y_test = y_test ``` dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train) output = xgb.dask.train( client, {"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"}, dtrain, num_boost_round=100, evals=[(dtrain, "train")])`
1
answers
0
votes
6
views
asked 16 days ago

ECS EC2 Instance is not register to target group

I create a ECS service using EC2 instances, then i create an Application Load Balancer and a target group, my docker image the task definition its using follow configuration: ```json { "ipcMode": null, "executionRoleArn": null, "containerDefinitions": [ { "dnsSearchDomains": null, "environmentFiles": null, "logConfiguration": { "logDriver": "awslogs", "secretOptions": null, "options": { "awslogs-group": "/ecs/onestapp-task-prod", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "ecs" } }, "entryPoint": null, "portMappings": [ { "hostPort": 0, "protocol": "tcp", "containerPort": 80 } ], "cpu": 0, "resourceRequirements": null, "ulimits": null, "dnsServers": null, "mountPoints": [], "workingDirectory": null, "secrets": null, "dockerSecurityOptions": null, "memory": null, "memoryReservation": 512, "volumesFrom": [], "stopTimeout": null, "image": "637960118793.dkr.ecr.us-east-2.amazonaws.com/onestapp-repository-prod:5ea9baa2a6165a91c97aee3c037b593f708b33e7", "startTimeout": null, "firelensConfiguration": null, "dependsOn": null, "disableNetworking": null, "interactive": null, "healthCheck": null, "essential": true, "links": null, "hostname": null, "extraHosts": null, "pseudoTerminal": null, "user": null, "readonlyRootFilesystem": false, "dockerLabels": null, "systemControls": null, "privileged": null, "name": "onestapp-container-prod" } ], "placementConstraints": [], "memory": "1024", "taskRoleArn": null, "compatibilities": [ "EXTERNAL", "EC2" ], "taskDefinitionArn": "arn:aws:ecs:us-east-2:637960118793:task-definition/onestapp-task-prod:25", "networkMode": null, "runtimePlatform": null, "cpu": "1024", "revision": 25, "status": "ACTIVE", "inferenceAccelerators": null, "proxyConfiguration": null, "volumes": [] } ``` The service its using ALB and its using same Target Group as ALB, my task its running, and i can access using public ip from instance, but the target group does not have registered my tasks.
0
answers
0
votes
2
views
asked 23 days ago
  • 1
  • 90 / page