By using AWS re:Post, you agree to the Terms of Use
/Containers/Questions/
Questions in Containers
Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

How to create dynamic dataframe from AWS Glue catalog in local environment?

I I have performed some AWS Glue version 3.0 jobs testing using Docker containers as detailed [here](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/). The following code outputs two lists, one per connection, with the names of the tables in a database: ```python import boto3 db_name_s3 = "s3_connection_db" db_name_mysql = "glue_catalog_mysql_connection_db" def retrieve_tables(database_name): session = boto3.session.Session() glue_client = session.client("glue") response_get_tables = glue_client.get_tables(DatabaseName=database_name) return response_get_tables s3_tables_list = [table_dict["Name"] for table_dict in retrieve_tables(db_name_s3)["TableList"]] mysql_tables_list = [table_dict["Name"] for table_dict in retrieve_tables(db_name_mysql)["TableList"]] print(f"These are the tables from {db_name_s3} db: {s3_tables_list}\n") print(f"These are the tables from {db_name_mysql} db {mysql_tables_list}") ``` Now, I try to create a dynamic dataframe with the *from_catalog* method in this way: ```python import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.dynamicframe import DynamicFrame source_activities = glueContext.create_dynamic_frame.from_catalog( database = db_name, table_name =table_name ) ``` When `database="s3_connection_db"`, everything works fine, however, when I set `database="glue_catalog_mysql_connection_db"`, I get the following error: ```python Py4JJavaError: An error occurred while calling o45.getDynamicFrame. : java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver ``` I understand the issue is related to the fact that I am trying to fetch data from a mysql table but I am not sure how to solve this. By the way, the job runs fine on the Glue console. I would really appreciate some help, thanks!
0
answers
0
votes
12
views
asked 2 days ago

ApplicationLoadBalancedFargateService with listener on one port and health check on another fails health check

Hi, I have an ApplicationLoadBalancedFargateService that exposes a service on one port, but the health check runs on another. Unfortunately, the target fails health check and terminates the task. Here's a snippet of my code ``` const hostPort = 5701; const healthCheckPort = 8080; taskDefinition.addContainer(stackPrefix + 'Container', { image: ecs.ContainerImage.fromRegistry('hazelcast/hazelcast:3.12.6'), environment : { 'JAVA_OPTS': `-Dhazelcast.local.publicAddress=localhost:${hostPort} -Dhazelcast.rest.enabled=true`, 'LOGGING_LEVEL':'DEBUG', 'PROMETHEUS_PORT': `${healthCheckPort}`}, portMappings: [{containerPort : hostPort, hostPort: hostPort},{containerPort : healthCheckPort, hostPort: healthCheckPort}], logging: ecs.LogDriver.awsLogs({streamPrefix: stackPrefix, logRetention: logs.RetentionDays.ONE_DAY}), }); const loadBalancedFargateService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, stackPrefix + 'Service', { cluster, publicLoadBalancer : false, desiredCount: 1, listenerPort: hostPort, taskDefinition: taskDefinition, securityGroups : [fargateServiceSecurityGroup], domainName : env.getPrefixedRoute53(stackName), domainZone : env.getDomainZone(), }); loadBalancedFargateService.targetGroup.configureHealthCheck({ path: "/metrics", port: healthCheckPort.toString(), timeout: cdk.Duration.seconds(15), interval: cdk.Duration.seconds(30), healthyThresholdCount: 2, unhealthyThresholdCount: 5, healthyHttpCodes: '200-299' }); ``` Any suggestions on how I can get this to work? thanks
1
answers
0
votes
35
views
asked 5 days ago

Containers/services unable to communicate between them

I have created an ECS cluster that runs one service on Fargate with one task definition. The task definition runs two containers that are supposed to communicate with each other: - nginx (using `fastcgi_pass <hostname>:9000;`) - php-fpm I have tried running them in one task definition or in seperate services (with Service Discovery set with either A records or SRV records - I have tried all the options). Other info: - Public VPC with two public subnets - Security group that allows access from itself to port 9000 (the php-fpm port) - Load balancer connected to the nginx container on port 80 Here is one of the task definitions that I tried, in this case running the containers in the same task definition (nginx has `fastcgi_pass localhost:9000;`). I hope somebody can help me... It can't be this hard to do something so simple. Nothing seems to work. ``` { "ipcMode": null, "executionRoleArn": "arn:aws:iam::359816492978:role/ecsTaskExecutionRole", "containerDefinitions": [ { "dnsSearchDomains": null, "environmentFiles": null, "logConfiguration": { "logDriver": "awslogs", "secretOptions": null, "options": { "awslogs-group": "/ecs/v1-stage", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "ecs" } }, "entryPoint": null, "portMappings": [ { "hostPort": 80, "protocol": "tcp", "containerPort": 80 } ], "command": null, "linuxParameters": null, "cpu": 0, "environment": [], "resourceRequirements": null, "ulimits": null, "dnsServers": null, "mountPoints": [], "workingDirectory": null, "secrets": null, "dockerSecurityOptions": null, "memory": null, "memoryReservation": null, "volumesFrom": [], "stopTimeout": null, "image": "359816492978.dkr.ecr.us-east-1.amazonaws.com/nginx", "startTimeout": null, "firelensConfiguration": null, "dependsOn": null, "disableNetworking": null, "interactive": null, "healthCheck": null, "essential": true, "links": [], "hostname": null, "extraHosts": null, "pseudoTerminal": null, "user": null, "readonlyRootFilesystem": null, "dockerLabels": null, "systemControls": null, "privileged": null, "name": "nginx" }, { "dnsSearchDomains": null, "environmentFiles": null, "logConfiguration": { "logDriver": "awslogs", "secretOptions": null, "options": { "awslogs-group": "/ecs/v1-stage", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "ecs" } }, "entryPoint": null, "portMappings": [ { "hostPort": 9000, "protocol": "tcp", "containerPort": 9000 } ], "command": null, "linuxParameters": null, "cpu": 0, "environment": [], "resourceRequirements": null, "ulimits": null, "dnsServers": null, "mountPoints": [], "workingDirectory": null, "secrets": null, "dockerSecurityOptions": null, "memory": null, "memoryReservation": null, "volumesFrom": [], "stopTimeout": null, "image": "359816492978.dkr.ecr.us-east-1.amazonaws.com/php", "startTimeout": null, "firelensConfiguration": null, "dependsOn": null, "disableNetworking": null, "interactive": null, "healthCheck": null, "essential": true, "links": [], "hostname": null, "extraHosts": null, "pseudoTerminal": null, "user": null, "readonlyRootFilesystem": null, "dockerLabels": null, "systemControls": null, "privileged": null, "name": "php" } ], "placementConstraints": [], "memory": "1024", "taskRoleArn": "arn:aws:iam::359816492978:role/ecsTaskExecutionRole", "compatibilities": [ "EC2", "FARGATE" ], "taskDefinitionArn": "arn:aws:ecs:us-east-1:359816492978:task-definition/v1-stage:5", "family": "v1-stage", "requiresAttributes": [ { "targetId": null, "targetType": null, "value": null, "name": "com.amazonaws.ecs.capability.logging-driver.awslogs" }, { "targetId": null, "targetType": null, "value": null, "name": "ecs.capability.execution-role-awslogs" }, { "targetId": null, "targetType": null, "value": null, "name": "com.amazonaws.ecs.capability.ecr-auth" }, { "targetId": null, "targetType": null, "value": null, "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19" }, { "targetId": null, "targetType": null, "value": null, "name": "com.amazonaws.ecs.capability.task-iam-role" }, { "targetId": null, "targetType": null, "value": null, "name": "ecs.capability.execution-role-ecr-pull" }, { "targetId": null, "targetType": null, "value": null, "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18" }, { "targetId": null, "targetType": null, "value": null, "name": "ecs.capability.task-eni" } ], "pidMode": null, "requiresCompatibilities": [ "FARGATE" ], "networkMode": "awsvpc", "runtimePlatform": { "operatingSystemFamily": "LINUX", "cpuArchitecture": null }, "cpu": "512", "revision": 5, "status": "ACTIVE", "inferenceAccelerators": null, "proxyConfiguration": null, "volumes": [] } ```
1
answers
0
votes
11
views
asked 6 days ago

EKS VPC-CNI Plugin Node Group Setup Questions

I am creating an EKS managed node group in terraform using the eks module version 17.1.0 and up until now specifying the bootstrap_extra_args like so has been working ``` node_groups = [{ name = "${var.environment}-nodes" desired_capacity = var.eks_cluster.desired_capacity max_capacity = var.eks_cluster.max_capacity min_capacity = var.eks_cluster.min_capacity additional_security_group_ids = aws_security_group.nodes.id instance_types = [var.eks_cluster.node_instance_type] key_name = "$$$$$$" bootstrap_extra_args = "/etc/eks/bootstrap.sh '${local.cluster_name}' --use-max-pods false --kubelet-extra-args '--max-pods=110'" }] ``` I have created two clusters like this and the nodes have been created with the max pods set to 110. both of these clusters are in us-east-2 I am now trying to create a cluster in China region cn-northwest-1 and the same configuration only sets the max pods to 35 and I cannot seem to get it to go any higher. Node types: t3a.large instances Note: I have also attempted to launch the nodes in China with a launch_template containing the following userdata script and the script is read, there are no errors that I can find and I end up with the same result. ``` MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="//" --// Content-Type: text/x-shellscript; charset="us-ascii" #!/bin/bash -xe /etc/eks/bootstrap.sh '${cluster_name}' --use-max-pods false --kubelet-extra-args '--max-pods=110' --//-- ``` This begs the question, are eks managed node groups setup a bit differently in china? Is what I'm trying to do even possible without some crazy workaround I cannot seem to find?
1
answers
0
votes
21
views
asked 6 days ago

AWS EKS - EIA attached on node not reachable by Pod

I'm using a standard **AWS EKS** cluster, all cloud based (K8S 1.22) with multiple node groups, one of which uses a Launch Template that defines an Elastic Inference Accelerator attached to the instances (eia2.medium) to serve some kind of Tensorflow model. I've been struggling a lot to make our Deep Learning model working at all while deployed, namely I have a Pod in a Deployment with a Service Account and an **EKS IRSA** policy attached, that is based on AWS Deep Learning Container for inference model serving based on Tensorflow 1.15.0. The image used is `763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference-eia:1.15.0-cpu` and when the model is deployed in the cluster, with a node affinity to the proper EIA-enabled node, it simply doesn't work when called using /invocations endpoint: ``` Using Amazon Elastic Inference Client Library Version: 1.6.3 Number of Elastic Inference Accelerators Available: 1 Elastic Inference Accelerator ID: eia-<id> Elastic Inference Accelerator Type: eia2.medium Elastic Inference Accelerator Ordinal: 0 2022-05-11 13:47:17.799145: F external/org_tensorflow/tensorflow/contrib/ei/session/eia_session.cc:1221] Non-OK-status: SwapExStateWithEI(tmp_inputs, tmp_outputs, tmp_freeze) status: Internal: Failed to get the initial operator <redacted>list from server. WARNING:__main__:unexpected tensorflow serving exit (status: 134). restarting. ``` Just as a reference, when using the CPU-only image available at `763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:1.15.0-cpu`, the model serves perfectly in any environment (locally too), of course with much longer computational time. Along with this, if i deploy a single EC2 instance with the attached EC2, and serve the container using a simple Docker command, the EIA works fine and is accessed correctly by the container. Each EKS node and the Pod itself (via IRSA) has the following policy attached: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "elastic-inference:Connect", "iam:List*", "iam:Get*", "ec2:Describe*", "ec2:Get*", "ec2:ModifyInstanceAttribute" ], "Resource": "*" } ] } ``` as per documentation from AWS itself, also i have created a **VPC Endpoint for Elastic Inference** as described by AWS and binded it to the private subnets used by EKS nodes along with a properly configured **Security Group** which allows **SSH**, **HTTPS** and **8500/8501 TCP** ports from any worker node in the VPC CIDR. Using both the **AWS Reachability Analyzer** and the **IAM Policy Simulator** nothing seems wrong and the networking and permissions seem fine, while also the *EISetupValidator.py* script provided by AWS says the same. Any clue on what's actually happening here? Am i missing some kind of permissions or networking setup?
0
answers
0
votes
9
views
asked 6 days ago

Pyspark job fails on EMR on EKS virtual cluster: java.lang.ClassCastException

Hi, we are in the process of migrating our pyspark jobs from EMR classic (EC2-based) to EMR on EKS virtual cluster. We have come across a strange failure in one job where we are reading some avro data from s3 and saving them straight back in parquet format. Example code: ``` df = spark.read.format("avro").load(input_path) df \ .withColumnRenamed("my_col", "my_new_col") \ .repartition(60) \ .write \ .mode("append") \ .partitionBy("my_new_col", "date") \ .format("parquet") \ .option("compression", "gzip") \ .save(output_path) ``` This fails with the following message at the .save() call (We can tell from the Python traceback, not included here for brevity): > Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 17) (10.0.3.174 executor 4): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.dataReader$1 of type scala.Function1 in instance of org.apache.spark.sql.execution.datasources.FileFormat$$anon$1 We are running this with `--packages org.apache.spark:spark-avro_2.12:3.1.1` in sparkSubmitParameters. Exact same code ran fine in a normal EMR cluster. Comparing the environments, both have Spark 3.1.1, Scala version version 2.12.10, only the Java version is different: 1.8.0_332 (EMR classic) vs 1.8.0_302 (EMR on EKS). We should also mention that we were able to run another job successfuly on EMR on EKS, that job doesn't have this avro-to-parquet step (the input is already in parquet format). So we suspect it has something to do with the extra org.apache.spark:spark-avro_2.12:3.1.1 package we are importing. We searched the web for the java.lang.ClassCastException and found a couple of issues [here](https://issues.apache.org/jira/browse/SPARK-29497) and [here](https://issues.apache.org/jira/browse/SPARK-25047), but they are not particularly helpful to us since our code is in Python. Any hints what might be the cause? Thanks and regards, Nikos
1
answers
0
votes
12
views
asked 9 days ago

Lambda function as image, how to find your handler URI

Hello, I have followed all of the tutorials on how to build an AWS Lambda function as a container image. I am also using the AWS SAM SDK as well. What I don't understand is how do I figure out my end-point URL mapping from within my image to the Lambda function? For example in my docker image that I am using the AWS Python 3.9 image where I install some other packages and my python requirements and my handler is defined as: summarizer_function_lambda.postHandler My python file being copied into the image is the same name as above but without the .postHandler My AWS SAM Template has: AWSTemplateFormatVersion: "2010-09-09" Transform: AWS::Serverless-2016-10-31 Description: AWS Lambda dist-bart-summarizer function # More info about Globals: https://github.com/awslabs/serverless-application-model/blob/master/docs/globals.rst Globals: Function: Timeout: 3 Resources: DistBartSum: Type: AWS::Serverless::Function # More info about Function Resource: https://github.com/awslabs/serverless-application-model/blob/master/versions/2016-10-31.md#awsserverlessfunction Properties: FunctionName: DistBartSum ImageUri: <my-image-url> PackageType: Image Events: SummarizerFunction: Type: Api # More info about API Event Source: https://github.com/awslabs/serverless-application-model/blob/master/versions/2016-10-31.md#api Properties: Path: /postHandler Method: POST So what is my actual URI path to do my POST call either locally or once deployed on Lambda?? When I try and do a CURL command I get an "{"message": "Internal server error"}% " curl -XPOST "https://<my-aws-uri>/Prod/postHandler/" -d '{"content": "Test data.\r\n"}' So I guess my question is how do you "map" your handler definitions from within a container all the way to the end point URI?
2
answers
0
votes
37
views
asked 11 days ago

Django App in ECS Container Cannot Connect to S3 in Gov Cloud

I have a container running in an EC2 instance on ECS. The container is hosting a django based application that utilizes S3 and RDS for its file storage and db needs respectively. I have appropriately configured my VPC, Subnets, VPC endpoints, Internet Gateway, roles, security groups, and other parameters such that I am able to host the site, connect to the RDS instance, and I can even access the site. The issue is with the connection to S3. When I try to run the command `python manage.py collectstatic --no-input` which should upload/update any new/modified files to S3 as part of the application set up the program hangs and will not continue. No files are transferred to the already set up S3 bucket. **Details of the set up:** All of the below is hosted on AWS Gov Cloud **VPC and Subnets** * 1 VPC located in Gov Cloud East with 2 availability zones (AZ) and one private and public subnet in each AZ (4 total subnets) * The 3 default routing tables (1 for each private subnet, and 1 for the two public subnets together) * DNS hostnames and DNS resolution are both enabled **VPC Endpoints** All endpoints have the "vpce-sg" security group attached and are associated to the above vpc * s3 gateway endpoint (set up to use the two private subnet routing tables) * ecr-api interface endpoint * ecr-dkr interface endpoint * ecs-agetn interface endpoint * ecs interface endpoint * ecs-telemetry interface endpoint * logs interface endpoint * rds interface endpoint **Security Groups** * Elastic Load Balancer Security Group (elb-sg) * Used for the elastic load balancer * Only allows inbound traffic from my local IP * No outbound restrictions * ECS Security Group (ecs-sg) * Used for the EC2 instance in ECS * Allows all traffic from the elb-sg * Allows http:80, https:443 from vpce-sg for s3 * Allows postgresql:5432 from vpce-sg for rds * No outbound restrictions * VPC Endpoints Security Group (vpce-sg) * Used for all vpc endpoints * Allows http:80, https:443 from ecs-sg for s3 * Allows postgresql:5432 from ecs-sg for rds * No outbound restrictions **Elastic Load Balancer** * Set up to use an Amazon Certificate https connection with a domain managed by GoDaddy since Gov Cloud route53 does not allow public hosted zones * Listener on http permanently redirects to https **Roles** * ecsInstanceRole (Used for the EC2 instance on ECS) * Attached policies: AmazonS3FullAccess, AmazonEC2ContainerServiceforEC2Role, AmazonRDSFullAccess * Trust relationships: ec2.amazonaws.com * ecsTaskExecutionRole (Used for executionRole in task definition) * Attached policies: AmazonECSTaskExecutionRolePolicy * Trust relationships: ec2.amazonaws.com, ecs-tasks.amazonaws.com * ecsRunTaskRole (Used for taskRole in task definition) * Attached policies: AmazonS3FullAccess, CloudWatchLogsFullAccess, AmazonRDSFullAccess * Trust relationships: ec2.amazonaws.com, ecs-tasks.amazonaws.com **S3 Bucket** * Standard bucket set up in the same Gov Cloud region as everything else **Trouble Shooting** If I bypass the connection to s3 the application successfully launches and I can connect to the website, but since static files are supposed to be hosted on s3 there is less formatting and images are missing. Using a bastion instance I was able to ssh into the EC2 instance running the container and successfully test my connection to s3 from there using `aws s3 ls s3://BUCKET_NAME` If I connect to a shell within the application container itself and I try to connect to the bucket using... ``` s3 = boto3.resource('s3') bucket = s3.Bucket(BUCKET_NAME) s3.meta.client.head_bucket(Bucket=bucket.name) ``` I receive a timeout error... ``` File "/.venv/lib/python3.9/site-packages/urllib3/connection.py", line 179, in _new_conn raise ConnectTimeoutError( urllib3.exceptions.ConnectTimeoutError: (<botocore.awsrequest.AWSHTTPSConnection object at 0x7f3da4467190>, 'Connection to BUCKET_NAME.s3.amazonaws.com timed out. (connect timeout=60)') ... File "/.venv/lib/python3.9/site-packages/botocore/httpsession.py", line 418, in send raise ConnectTimeoutError(endpoint_url=request.url, error=e) botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://BUCKET_NAME.s3.amazonaws.com/" ``` Based on [this article ](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html#vpc-endpoints-policies-s3) I think this may have something to do with the fact that I am using the GoDaddy DNS servers which may be preventing proper URL resolution for S3. > If you're using the Amazon DNS servers, you must enable both DNS hostnames and DNS resolution for your VPC. If you're using your own DNS server, ensure that requests to Amazon S3 resolve correctly to the IP addresses maintained by AWS. I am unsure of how to ensure that requests to Amazon S3 resolve correctly to the IP address maintained by AWS. Perhaps I need to set up another private DNS on route53? I have tried a very similar set up for this application in AWS non-Gov Cloud using route53 public DNS instead of GoDaddy and there is no issue connecting to S3. Please let me know if there is any other information I can provide to help.
3
answers
0
votes
39
views
asked 11 days ago
1
answers
0
votes
15
views
asked 17 days ago

Unable to override taskRoleArn when running ECS task from Lambda

I have a Lambda function that is supposed to pass its own permissions to the code running in an ECS task. It looks like this: ``` ecs_parameters = { "cluster": ..., "launchType": "FARGATE", "networkConfiguration": ..., "overrides": { "taskRoleArn": boto3.client("sts").get_caller_identity().get("Arn"), ... }, "platformVersion": "LATEST", "taskDefinition": f"my-task-definition-{STAGE}", } response = ecs.run_task(**ecs_parameters) ``` When I run this in Lambda, i get this error: ``` "errorMessage": "An error occurred (ClientException) when calling the RunTask operation: ECS was unable to assume the role 'arn:aws:sts::787364832896:assumed-role/my-lambda-role...' that was provided for this task. Please verify that the role being passed has the proper trust relationship and permissions and that your IAM user has permissions to pass this role." ``` If I change the task definition in ECS to use `my-lambda-role` as the task role, it works. It's specifically when I try to override the task role from Lambda that it breaks. The Lambda role has the `AWSLambdaBasicExecutionRole` policy and also an inline policy that grants it `ecs:runTask` and `iam:PassRole`. It has a trust relationship that looks like: ``` "Effect": "Allow", "Principal": { "Service": [ "ecs.amazonaws.com", "lambda.amazonaws.com", "ecs-tasks.amazonaws.com" ] }, "Action": "sts:AssumeRole" ``` The task definition has a policy that grants it `sts:AssumeRole` and `iam:PassRole`, and a trust relationship that looks like: ``` "Effect": "Allow", "Principal": { "Service": "ecs-tasks.amazonaws.com", "AWS": "arn:aws:iam::account-ID:role/aws-service-role/ecs.amazonaws.com/AWSServiceRoleForECS" }, "Action": "sts:AssumeRole" ``` How do I allow the Lambda function to pass the role to ECS, and ECS to assume the role it's been given? P.S. - I know a lot of these permissions are overkill, so let me know if there are any I can get rid of :) Thanks!
2
answers
1
votes
21
views
asked 18 days ago

Architecture for multi-region ECS application

Hi everyone, I just wanted to get feedback on my proposed solution for a multi-region ECS dockerized app. Currently we have the following resources in Region A: ``` Postgres DB (Used for user accounts only) Backend+Frontend NextJS App (Dockerized) ECS Backend Microservice App for conversion of files (Dockerized) ECS Backend 3rd party API + Datastore (This resource is also deployed in other regions) Unknown architecture ``` I now need to deploy to Regions B and C. The Backend 3rd party API is already deployed in these regions. I am thinking of deploying the following resources to the following regions: ``` Backend+Frontend NextJS App (Dockerized) Backend Microservice App for conversion of files (Dockerized) ``` Our app logs in the user (authentication + authorization) using the 3rd party API, and after login we can see which region their data is in. So after login I can bounce them + their token to the appropriate region. I cannot use Route53 routing reliably because the Source of Truth about their region is available after login, and, for example, they may be (rarely) accessing from region B (if they are travelling) while their datastore is in region C (In which case I need to bounce them to region C). I also don't need to replicate our database to other regions because it only stores their account information for billing purposes, so the performance impact is minimal and only checked on login/logout. Currently we have low 10s of users, so I can easily restructure and deploy a different architecture if/when we start scaling. Critique is welcome!
1
answers
0
votes
11
views
asked 19 days ago
0
answers
0
votes
4
views
asked 20 days ago

XGBoost Error: Allreduce failed - 100GB Dask Dataframe on AWS Fargate ECS cluster dies with 1T of memory.

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster. Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties: Workers: 40 Total threads: 160 Total memory: 1 TB So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail. Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error `XGBoostError: rabit/internal/utils.h:90: Allreduce failed`. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe? ``` X_train = X_train.to_dask_array() X_test = X_test.to_dask_array() y_train = y_train y_test = y_test ``` dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train) output = xgb.dask.train( client, {"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"}, dtrain, num_boost_round=100, evals=[(dtrain, "train")])`
1
answers
0
votes
6
views
asked 20 days ago

How can I build a CloudFormation secret out of another secret?

I have an image I deploy to ECS that expects an environment variable called `DATABASE_URL` which contains the username and password as the userinfo part of the url (e.g. `postgres://myusername:mypassword@mydb.foo.us-east-1.rds.amazonaws.com:5432/mydbname`). I cannot change the image. Using `DatabaseInstance.Builder.credentials(fromGeneratedSecret("myusername"))`, CDK creates a secret in Secrets Manager for me that has all of this information, but not as a single value: ```json { "username":"myusername", "password":"mypassword", "engine":"postgres", "host":"mydb.foo.us-east-1.rds.amazonaws.com", "port":5432, "dbInstanceIdentifier":"live-myproduct-db" } ``` Somehow I need to synthesise that `DATABASE_URL` environment variable. I don't think I can do it in the ECS Task Definition - as far as I can tell the secret can only reference a single key in a secret. I thought I might be able to add an extra `url` key to the existing secret using references in cloud formation - but I can't see how. Something like: ```java secret.newBuilder() .addTemplatedKey( "url", "postgres://#{username}:#{password}@#{host}:#{port}/#{db}" ) .build() ``` except that I just made that up... Alternatively I could use CDK to generate a new secret in either Secrets Manager or Systems Manager - but again I want to specify it as a template so that the real secret values don't get materialised in the CloudFormation template. Any thoughts? I'm hoping I'm just missing some way to use the API to build compound secrets...
3
answers
0
votes
16
views
asked 20 days ago

boto3 ecs.describe_task call returns task missing

I'm trying to use a boto3 ECS waiter to wait on a Fargate ECS task to complete. The vast majority of the time the waiter works as expected (waits for the task to reach the STOPPED status). However, sporadically the waiter will return a failure because a task is marked as missing. However, I can find the task itself in the cluster and cloudwatch logs for the task. When I first encountered this, I switched to using boto3 [`ecs.describe_tasks`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs.html#ECS.Client.describe_tasks) method to see if I could get more information about what was happening. When the above situation occurs, descirbe_tasks returns something like: ``` {'tasks': [], 'failures': [\{'arn': 'arn:aws:ecs:us-west-2:21234567891011:task/something-something/dsfsadfasdhfasjklhdfkdsajhf', 'reason': 'MISSING'}], 'ResponseMetadata': \{'RequestId': sdkfjaskdjfhaskdjfhasd', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'sdkjfhsdkajhfksadhkjsadf', 'content-type': 'application/x-amz-json-1.1', 'content-length': '145', 'date': 'Fri, 06 May 2022 08:36:11 GMT'}, 'RetryAttempts': 0}} ``` I've looked at the [AWS Docs ](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/api_failures_messages.html) and the none of the scenarios outlined for `reason: MISSING` apply in my circumstance. I'm passing my cluster name as an argument to the call as well. Since this happens intermittently its difficult to troubleshoot. What does the MISSING status mean? What are the reasons why an API call to check on task status would return missing when the task exists?
1
answers
0
votes
9
views
asked 23 days ago

My ECS tasks (VPC A) can't connect to my RDS (VPC B) even though the VPCs are peered and networking is configured correctly

Hi, As mentioned in the question, my ECS tasks cannot connect to my RDS. The ECS tasks try to resolve the rds by name, and it resolves to the RDS public IP (RDS has public and private IPs). However, the security group on RDS doesn't allow open access from all IPs so the connection fails. I temporarily allowed all connections and could see that the ECS tasks are routing through the open internet to access the RDS. Reachability Analyzer checking specific tasks' Elastic Network Interface to the RDI ENI is successful, using internal routing through the peering connection. At the same time I have another server on VPC C that can connect to the RDS. All the config is similar between these two apps, including the peering connection, security group policies and routing tables. Any help is appreciated Here are some details about the VPCs VPC A - 15.2.0.0/16 [three subnets] VPC B - 111.30.0.0/16 [three subnets] VPC C - 15.0.0.0/16 [three subnets] Peering Connection 1 between A and B Peering Connection 2 between C and B Route table for VPC A: 111.30.0.0/16 : Peering Connection 1 15.2.0.0/16: Local 0.0.0.0/0: Internet Gateway Route table for VPC C: 111.30.0.0/16: Peering Connection 2 15.2.0.0/16: Local 0.0.0.0/0: Internet Gateway Security groups allow traffic to RDS: Ingress: 15.0.0.0/16: Allow DB Port 15.2.0.0/16: Allow DB Port Egress: 0.0.0.0/0: Allow all ports When I add the rule: 0.0.0.0/0 Allow DB Port to the RDS, then ECS can connect to my RDS through its public IP.
1
answers
2
votes
7
views
asked 24 days ago

ApplicationLoadBalancedFargateService with load balancer, target groups, targets on non-standard port

I have an ECS service that exposes port 8080. I want to have the load balancer, target groups and target use that port as opposed to port 80. Here is a snippet of my code: ``` const servicePort = 8888; const metricsPort = 8888; const taskDefinition = new ecs.FargateTaskDefinition(this, 'TaskDef'); const repository = ecr.Repository.fromRepositoryName(this, 'cloud-config-server', 'cloud-config-server'); taskDefinition.addContainer('Config', { image: ecs.ContainerImage.fromEcrRepository(repository), portMappings: [{containerPort : servicePort, hostPort: servicePort}], }); const albFargateService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'AlbConfigService', { cluster, publicLoadBalancer : false, taskDefinition: taskDefinition, desiredCount: 1, }); const applicationTargetGroup = new elbv2.ApplicationTargetGroup(this, 'AlbConfigServiceTargetGroup', { targetType: elbv2.TargetType.IP, protocol: elbv2.ApplicationProtocol.HTTP, port: servicePort, vpc, healthCheck: {path: "/CloudConfigServer/actuator/env/profile", port: String(servicePort)} }); const addApplicationTargetGroupsProps: elbv2.AddApplicationTargetGroupsProps = { targetGroups: [applicationTargetGroup], }; albFargateService.loadBalancer.addListener('alb-listener', { protocol: elbv2.ApplicationProtocol.HTTP, port: servicePort, defaultTargetGroups: [applicationTargetGroup]} ); } } ``` This does not work. The health check is taking place on port 80 with the default URL of "/" which fails, and the tasks are constantly recycled. A target group on port 8080, with the appropriate health check, is added, but it has no targets. What is the recommended way to achieve load balancing on a port other than 80? thanks
1
answers
0
votes
18
views
asked 25 days ago
0
answers
0
votes
1
views
asked a month ago
1
answers
0
votes
23
views
asked a month ago

ECS EC2 Instance is not register to target group

I create a ECS service using EC2 instances, then i create an Application Load Balancer and a target group, my docker image the task definition its using follow configuration: ```json { "ipcMode": null, "executionRoleArn": null, "containerDefinitions": [ { "dnsSearchDomains": null, "environmentFiles": null, "logConfiguration": { "logDriver": "awslogs", "secretOptions": null, "options": { "awslogs-group": "/ecs/onestapp-task-prod", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "ecs" } }, "entryPoint": null, "portMappings": [ { "hostPort": 0, "protocol": "tcp", "containerPort": 80 } ], "cpu": 0, "resourceRequirements": null, "ulimits": null, "dnsServers": null, "mountPoints": [], "workingDirectory": null, "secrets": null, "dockerSecurityOptions": null, "memory": null, "memoryReservation": 512, "volumesFrom": [], "stopTimeout": null, "image": "637960118793.dkr.ecr.us-east-2.amazonaws.com/onestapp-repository-prod:5ea9baa2a6165a91c97aee3c037b593f708b33e7", "startTimeout": null, "firelensConfiguration": null, "dependsOn": null, "disableNetworking": null, "interactive": null, "healthCheck": null, "essential": true, "links": null, "hostname": null, "extraHosts": null, "pseudoTerminal": null, "user": null, "readonlyRootFilesystem": false, "dockerLabels": null, "systemControls": null, "privileged": null, "name": "onestapp-container-prod" } ], "placementConstraints": [], "memory": "1024", "taskRoleArn": null, "compatibilities": [ "EXTERNAL", "EC2" ], "taskDefinitionArn": "arn:aws:ecs:us-east-2:637960118793:task-definition/onestapp-task-prod:25", "networkMode": null, "runtimePlatform": null, "cpu": "1024", "revision": 25, "status": "ACTIVE", "inferenceAccelerators": null, "proxyConfiguration": null, "volumes": [] } ``` The service its using ALB and its using same Target Group as ALB, my task its running, and i can access using public ip from instance, but the target group does not have registered my tasks.
0
answers
0
votes
2
views
asked a month ago

Fail to start an EC2 task on ECS

Hi there i am trying to start a task which uses gpu on my instance. EC2 is already added to a cluster but it failed to start, here is the error: ``` status: STOPPED (CannotStartContainerError: Error response from dae) Details Status reason CannotStartContainerError: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr Network bindings - not configured ``` ec2: setup ``` Type: AWS::EC2::Instance Properties: IamInstanceProfile: !Ref InstanceProfile ImageId: ami-0d5564ca7e0b414a9 InstanceType: g4dn.xlarge KeyName: tmp-key SubnetId: !Ref PrivateSubnetOne SecurityGroupIds: - !Ref ContainerSecurityGroup UserData: Fn::Base64: !Sub | #!/bin/bash echo ECS_CLUSTER=traffic-data-cluster >> /etc/ecs/ecs.config echo ECS_ENABLED_GPU_SUPPORT=true >> /etc/ecs/ecs.config ``` Dockerfile ``` FROM nvidia/cuda:11.6.0-base-ubuntu20.04 ENV NVIDIA_VISIBLE_DEVICES all ENV NVIDIA_DRIVER_CAPABILITIES compute,utility # RUN nvidia-smi RUN echo 'install pip packages' RUN apt-get update RUN apt-get install python3.8 -y RUN apt-get install python3-pip -y RUN ln -s /usr/bin/python3 /usr/bin/python RUN pip3 --version RUN python --version WORKDIR / COPY deployment/video-blurring/requirements.txt /requirements.txt RUN pip3 install --upgrade pip RUN pip3 install --user -r /requirements.txt ## Set up the requisite environment variables that will be passed during the build stage ARG SERVER_ID ARG SERVERLESS_STAGE ARG SERVERLESS_REGION ENV SERVER_ID=$SERVER_ID ENV SERVERLESS_STAGE=$SERVERLESS_STAGE ENV SERVERLESS_REGION=$SERVERLESS_REGION COPY config/env-vars . ## Sets up the entry point for running the bashrc which contains environment variable and ## trigger the python task handler COPY script/*.sh / RUN ["chmod", "+x", "./initialise_task.sh"] ## Copy the code to /var/runtime - following the AWS lambda convention ## Use ADD to preserve the underlying directory structure ADD src /var/runtime/ ENTRYPOINT ./initialise_task.sh ```
0
answers
0
votes
3
views
asked a month ago

Container cannot bind to port 80 running as non-root user on ECS Fargate

I have an image that binds to port 80 as a **non-root** user. I can run it locally (macOS Monterey, Docker Desktop 4.7.1) absolutely fine. When I try and run it as part of an ECS service on Fargate it fails as so: **Failed to bind to 0.0.0.0/0.0.0.0:80** **caused by SocketException: Permission denied** Fargate means I have to run the task in network mode `awsvpc` - not sure if that's related? Any views on what I'm doing wrong? The [best practices document](https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/bestpracticesguide.pdf) suggests that I should be running as non-root (p.83) and that under awsvpc it's reasonable to expose port 80 (diagram on p.23). FWIW here's a mildly cut down version the JSON from my task definition: ``` { "taskDefinitionArn": "arn:aws:ecs:us-east-1:<ID>:task-definition/mything:2", "containerDefinitions": [ { "name": "mything", "image": "mything:latest", "cpu": 0, "memory": 1024, "portMappings": [ { "containerPort": 80, "hostPort": 80, "protocol": "tcp" } ], "essential": true, "environment": [] } ], "family": "mything", "executionRoleArn": "arn:aws:iam::<ID>:role/ecsTaskExecutionRole", "networkMode": "awsvpc", "revision": 2, "volumes": [], "status": "ACTIVE", "requiresAttributes": [ { "name": "com.amazonaws.ecs.capability.logging-driver.awslogs" }, { "name": "ecs.capability.execution-role-awslogs" }, { "name": "com.amazonaws.ecs.capability.ecr-auth" }, { "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19" }, { "name": "ecs.capability.execution-role-ecr-pull" }, { "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18" }, { "name": "ecs.capability.task-eni" } ], "placementConstraints": [], "compatibilities": [ "EC2", "FARGATE" ], "runtimePlatform": { "operatingSystemFamily": "LINUX" }, "requiresCompatibilities": [ "FARGATE" ], "cpu": "256", "memory": "1024", "tags": [] } ```
2
answers
0
votes
97
views
asked a month ago

Ingress annotations only for a specific path

Hi, I have this ingress configuration: ``` apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: "oidc-ingress" annotations: kubernetes.io/ingress.class: alb alb.ingress.kubernetes.io/scheme: internet-facing alb.ingress.kubernetes.io/target-type: ip alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]' alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}' alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=300 external-dns.alpha.kubernetes.io/hostname: example.com !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! alb.ingress.kubernetes.io/auth-type: oidc alb.ingress.kubernetes.io/auth-on-unauthenticated-request: authenticate alb.ingress.kubernetes.io/auth-idp-oidc: '{"issuer":"https://login.microsoftonline.com/some-id/v2.0","authorizationEndpoint":"https://login.microsoftonline.com/some-id/oauth2/v2.0/authorize","tokenEndpoint":"https://login.microsoftonline.com/some-id/oauth2/v2.0/token","userInfoEndpoint":"https://graph.microsoft.com/oidc/userinfo","secretName":"aws-alb-secret"}' !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! spec: rules: - http: paths: - pathType: Prefix path: / backend: service: name: ssl-redirect port: name: use-annotation - pathType: Prefix path: /jenkins backend: service: name: jenkins port: number: 8080 - pathType: Prefix path: / backend: service: name: apache port: number: 80 ``` If I `kubectl appy` this `Ingress` config it will apply `annotations` to all routing rules, which means: ``` /* /jenkins /jenkins/* ``` I would like to apply `OIDC annotations` only for the `Jenkins rules`, it means: 1. If I open `https://example.com` it will be available to everyone. 2. If I open `https://example.com/jenkins`, it will redirect me to `OIDC auth` page. I can do this manually through `AWS console` when I remove `authenticate rule` from `/*` and leave it for `/jenkins/*` only. However I would like to achieve this through `Ingress annotations` to be able to automate this process. Please how can I do this? Thanks for your help.
2
answers
0
votes
40
views
asked a month ago

Scheduled Action triggering at time specified in another action

I have a CloudFormation setup with Scheduled Actions to autoscale services based on times. There is one action that scales up to start the service, and another to scale down to turn it off. I also occasionally add an additional action to scale up if a service is needed at a different time on a particular day. I'm having an issue where my service is being scaled down instead of up when I specify this additional action. Looking at the console logs I get an event that looks like: ``` 16:00:00 -0400 Message: Successfully set min capacity to 0 and max capacity to 0 Cause: scheduled action name ScheduleScaling_action_1 was triggered ``` However the relevant part of the CloudFormation Template for the Scheduled Action with the name in the log has a different time, e.g.: ``` { "ScalableTargetAction": { "MaxCapacity": 0, "MinCapacity": 0 }, "Schedule": "cron(0 5 ? * 2-5 *)", "ScheduledActionName": "ScheduleScaling_action_1" } ``` What is odd is that the time this action is triggering matches exactly with the Schedule time for another action. E.g. ``` { "ScalableTargetAction": { "MaxCapacity": 1, "MinCapacity": 1 }, "Schedule": "cron(00 20 ? * 2-5 *)", "ScheduledActionName": "ScheduleScaling_action_2" } ``` I am using CDK to generate the CloudFormation template, which doesn't appear to allow me to specify a timezone. So my understanding is that the times here should be UTC. What could cause the scheduled action to trigger at the incorrect time like this?
1
answers
0
votes
6
views
asked a month ago
1
answers
0
votes
9
views
asked a month ago

Fargate timeout problem

Hi. I've got an unexpected error, I guess it's about timeout. Task takes time quite long. generally less than 1minute. There is no problem when it finish before 1 minute. But sometimes task takes more than 1 minute. Error occurs when it takes more than 1minute. Please see below. ``` [ec2-user@ip-000-00-0-00 ~]$ curl --location --request POST 'userid.ap-northeast-2.elb.amazonaws.com:port/service' -d "video.mp4" -o output.json -v Note: Unnecessary use of -X or --request, POST is already inferred. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 172.31.74.35:5000... * Connected to loadbalancer.ap-northeast-2.elb.amazonaws.com (000.00.00.00) port 0000 (#0) > POST /service HTTP/1.1 > Host: userid.ap-northeast-2.elb.amazonaws.com:port > User-Agent: curl/7.79.1 > Accept: */* > Content-Length: 37 > Content-Type: application/x-www-form-urlencoded > } [37 bytes data] 100 37 0 0 0 37 0 0 --:--:-- 0:00:59 --:--:-- 0* Mark bundle as not supporting multiuse < HTTP/1.1 500 Internal Server Error < Date: Tue, 19 Apr 2022 01:45:23 GMT < Content-Type: application/octet-stream < Content-Length: 0 < Connection: keep-alive < Server: Python/3.7 aiohttp/3.7.4.post0 < 100 37 0 0 0 37 0 0 --:--:-- 0:01:00 --:--:-- 0 * Connection #0 to host loadbalancer.ap-northeast-2.elb.amazonaws.com left intactm left intact ``` With curl verbose option, I get 500 Internal Server Error at 0:00:59. How can I finish my task which takes more than 1minutes? I've tried - increasing Health check grace period for **ECS Service** - increasing idle timeout of **Load Balancer** - increasing timeout and interval of **Target group** - curl options (like keep-alive, max-time) My EC2 Instance - type : t2.micro - Amazon Linux My Service - Service type : REPLICA - Launch type : FARGATE My task in service - Network mode : awsvpc - Compatibilities : EC2, FARGATE - Requries compatibilities : FARGATE - EFS mounted - Docker Appreciate,
1
answers
0
votes
40
views
asked a month ago
  • 1
  • 90 / page