By using AWS re:Post, you agree to the Terms of Use
/Amazon EC2/

Questions tagged with Amazon EC2

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

XGBoost Error: Allreduce failed - 100GB Dask Dataframe on AWS Fargate ECS cluster dies with 1T of memory.

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster. Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties: Workers: 40 Total threads: 160 Total memory: 1 TB So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail. Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error `XGBoostError: rabit/internal/utils.h:90: Allreduce failed`. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe? ``` X_train = X_train.to_dask_array() X_test = X_test.to_dask_array() y_train = y_train y_test = y_test ``` dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train) output = xgb.dask.train( client, {"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"}, dtrain, num_boost_round=100, evals=[(dtrain, "train")])`
1
answers
0
votes
5
views
AWS-User-7732475
asked 7 days ago

ECS EC2 Instance is not register to target group

I create a ECS service using EC2 instances, then i create an Application Load Balancer and a target group, my docker image the task definition its using follow configuration: ```json { "ipcMode": null, "executionRoleArn": null, "containerDefinitions": [ { "dnsSearchDomains": null, "environmentFiles": null, "logConfiguration": { "logDriver": "awslogs", "secretOptions": null, "options": { "awslogs-group": "/ecs/onestapp-task-prod", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "ecs" } }, "entryPoint": null, "portMappings": [ { "hostPort": 0, "protocol": "tcp", "containerPort": 80 } ], "cpu": 0, "resourceRequirements": null, "ulimits": null, "dnsServers": null, "mountPoints": [], "workingDirectory": null, "secrets": null, "dockerSecurityOptions": null, "memory": null, "memoryReservation": 512, "volumesFrom": [], "stopTimeout": null, "image": "637960118793.dkr.ecr.us-east-2.amazonaws.com/onestapp-repository-prod:5ea9baa2a6165a91c97aee3c037b593f708b33e7", "startTimeout": null, "firelensConfiguration": null, "dependsOn": null, "disableNetworking": null, "interactive": null, "healthCheck": null, "essential": true, "links": null, "hostname": null, "extraHosts": null, "pseudoTerminal": null, "user": null, "readonlyRootFilesystem": false, "dockerLabels": null, "systemControls": null, "privileged": null, "name": "onestapp-container-prod" } ], "placementConstraints": [], "memory": "1024", "taskRoleArn": null, "compatibilities": [ "EXTERNAL", "EC2" ], "taskDefinitionArn": "arn:aws:ecs:us-east-2:637960118793:task-definition/onestapp-task-prod:25", "networkMode": null, "runtimePlatform": null, "cpu": "1024", "revision": 25, "status": "ACTIVE", "inferenceAccelerators": null, "proxyConfiguration": null, "volumes": [] } ``` The service its using ALB and its using same Target Group as ALB, my task its running, and i can access using public ip from instance, but the target group does not have registered my tasks.
0
answers
0
votes
2
views
AWS-User-9232552
asked 14 days ago

EC2Launch InitializeInstance.ps1 Add-Routes fails when VMware installed on instance

# Background I created a Windows Server 2019 AMI with VMware workstation 15 installed. Before I captured the instance into an AMI I ran `InitializeInstance.ps1 -Schedule`. When I shared the AMI with another account which was launched in a different VPC, we tried connecting via SSM, but the AWS console was telling us that the instance wasn't configured properly. When we checked the logs we got the following output: ``` Windows sysprep configuration complete. 2022/04/20 16:40:16Z: Message: Failed to add routes.. attempting it again 2022/04/20 16:42:31Z: Message: Failed to add routes.. attempting it again 2022/04/20 16:44:46Z: Message: Failed to add routes.. attempting it again 2022/04/20 16:47:02Z: Message: Failed to add routes.. attempting it again 2022/04/20 16:49:17Z: Message: Failed to add routes.. attempting it again 2022/04/20 16:51:33Z: Message: Failed to add routes.. attempting it again 2022/04/20 16:53:49Z: Message: Failed to add routes.. attempting it again 2022/04/20 16:56:05Z: Message: Failed to add routes.. attempting it again 2022/04/20 16:58:20Z: Message: Failed to add routes.. attempting it again 2022/04/20 17:00:36Z: Message: Failed to add routes.. attempting it again 2022/04/20 17:02:53Z: Message: Failed to add routes.. attempting it again 2022/04/20 17:05:09Z: Message: Failed to add routes.. attempting it again 2022/04/20 17:06:14Z: EC2LaunchTelemetry: IsTelemetryEnabled=true 2022/04/20 17:06:14Z: EC2LaunchTelemetry: IsAgentScheduledPerBoot=true 2022/04/20 17:06:14Z: EC2LaunchTelemetry: AgentCommandErrorCode=1 ``` When I opened the `EC2Launch.log` file, the exception details showed the following: ``` Failed to add routes.. attempting it again Cannot bind argument to parameter 'Addresses' because it is null ``` This error is in line 103 of `C:\ProgramData\Amazon\EC2-Windows\Launch\Module\Scripts\Add-Routes.ps1` (only mention of an `Addresses` parameter in the `Add-Routes` function): ```powershell [string[]]$defaultGateways = @(FilterIPAddresses -Addresses $networkAdapterConfig.DefaultIPGateway -AddressFamily $AddressFamily) ``` I did a bit of source code diving to discover the source of the issue, which is described in the following section. # Bug Description `InitializeInstance.ps1` calls - among other things - a function called `Add-Routes` defined in `C:\ProgramData\Amazon\EC2-Windows\Launch\Module\Scripts\Add-Routes.ps1`. This function adds the correct network routes in order to be able to reach various services via the reserved IP addresses in `169.254.169.0/24`. To do so, it adds routes telling the operating system to route all of these packets through the default gateway of a primary network interface. The `Add-Routes.ps1` script determines the primary network interface by performing the following query on line 85: ```powershell $networkAdapter = Get-CimInstance -ClassName Win32_NetworkAdapter -Filter "AdapterTypeId='0' AND NetEnabled='True' AND NOT Name LIKE '%Hyper-V Virtual%' AND NOT Name LIKE '%Loopback Adapter%' AND NOT Name LIKE '%TAP-Windows Adapter%'" | Sort-Object -Property "InterfaceIndex" | Select-Object InterfaceIndex ``` and selecting the interface with the lowest value of `InterfaceIndex`. Notably, this query **does not** filter out values like "VMware Virtual Ethernet Adapter for VMnet 1" which is created whenever VMware workstation is installed. When I expermented I found that the aforemention virtual ethernet adapter had a lower `InterfaceIndex` than the primary interface "AWS PV Network Device #0". Since the virtual adapter does not have a default gateway, the result of trying to access it is `$null` which throws the exception and makes it unable to add the correct routes for the AWS services, rendering SSM inaccessible for that instance. # Solution...? Either, don't install VMware or change the indices of the virtual network adapters. This solves my specific use-case, but doesn't solve the future problem of having any other virtual adapters that don't get filtered out by that query which might have a lower index. Ideally the query to find a primary network interface for the default gateway should be altered to be more robust. Rather than filtering known bad values, a better solution might be to filter for known good values, thereby eliminating the need to filter out all possible virtual network adapters, which could have arbitrary names.
0
answers
0
votes
1
views
AWS-User-4917520
asked 18 days ago

Error when running vsock_sample AWS nitro tutorial

I have configured and build the enclave instance as per https://docs.aws.amazon.com/enclaves/latest/user/enclaves-user.pdf . But when I tried to run in it throws the following error ``` $ nitro-cli run-enclave --eif-path vsock_sample.eif --cpu-count 2 --enclave-cid 6 --memory 512 --debug-mode Start allocating memory... Started enclave with enclave-cid: 6, memory: 512 MiB, cpu-ids: [1, 5] [ E36 ] Enclave boot failure. Such error appears when attempting to receive the `ready` signal from a freshly booted enclave. It arises in several contexts, for instance, when the enclave is booted from an invalid EIF file and the enclave process immediately exits, failing to submit the `ready` signal. In this case, the error backtrace provides detailed information on what specifically failed during the enclave boot process. For more details, please visit https://docs.aws.amazon.com/enclaves/latest/user/cli-errors.html#E36 If you open a support ticket, please provide the error log found at "/var/log/nitro_enclaves/err2022-04-27T03:41:39.495653281+00:00.log" Failed connections: 1 [ E39 ] Enclave process connection failure. Such error appears when the enclave manager fails to connect to at least one enclave process for retrieving the description information. For more details, please visit https://docs.aws.amazon.com/enclaves/latest/user/cli-errors.html#E39 If you open a support ticket, please provide the error log found at "/var/log/nitro_enclaves/err2022-04-27T03:41:39.495889864+00:00.log" ``` Action: Run Enclave Subactions: Failed to handle all enclave process replies Failed to connect to 1 enclave processes Root error file: src/enclave_proc_comm.rs Root error line: 349 Build commit: not available ``` How to fix this error ?
0
answers
0
votes
0
views
c02f2e
asked 19 days ago

How to use psycopg2 to load data into Redshift tables with the copy command

I am trying to load data from an EC2 instance into Redshift tables but cannot figure out how to do this using the copy command. I have tried the following to create the sql queries: ``` def copy_query_creator(table_name, schema): copy_sql_template = sql.SQL("COPY {table_name} from stdin iam_role 'iam_role' DATEFORMAT 'MM-DD-YYYY' TIMEFORMAT 'MM-DD-YYYY HH12:MI:SS AM' ACCEPTINVCHARS fixedwidth {schema}").format(table_name = sql.Identifier(table_name),schema = schema) return copy_sql_template ``` and ``` def copy_query_creator_2(table_name, iam_role, schema): copy_sql_base = """ COPY {} FROM STDIN iam_role {} DATEFORMAT 'MM-DD-YYYY' TIMEFORMAT 'MM-DD-YYYY HH12:MI:SS AM' ACCEPTINVCHARS fixedwidth {}""".format(table_name, iam_role, schema) print(copy_sql_base) return copy_sql_base ``` where schema is the fixedwidth_spec in the example snippet below: ``` copy table_name from 's3://mybucket/prefix' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' fixedwidth 'fixedwidth_spec'; ``` The function that uses the query created looks like so: ``` def copy_query(self, filepath): schema = Query.create_schema() #returns the formatted fixedwidth_spec table_name = Query.get_table_def() #returns the table_name print(copy_query_creator_2(table_name, iam_role, schema)) self.connect() with self.connection.cursor() as cursor: try: with open(filepath) as f: cursor.copy_expert(copy_query_creator_2(table_name, iam_role, schema), f) print('copy worked') logging.info(f'{copy_query_creator_2(table_name, iam_role, schema)} ran; {cursor.rowcount} records copied.') except (Exception, psycopg2.Error) as error: logging.error(error) print(error) ``` The two attempts return errors. The first returns 'Composed elements must be Composable, got %r instead' while the latter returns 'error at or near STDIN'. Please help.
0
answers
0
votes
3
views
AWS-User-8813229
asked 20 days ago

Fail to start an EC2 task on ECS

Hi there i am trying to start a task which uses gpu on my instance. EC2 is already added to a cluster but it failed to start, here is the error: ``` status: STOPPED (CannotStartContainerError: Error response from dae) Details Status reason CannotStartContainerError: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr Network bindings - not configured ``` ec2: setup ``` Type: AWS::EC2::Instance Properties: IamInstanceProfile: !Ref InstanceProfile ImageId: ami-0d5564ca7e0b414a9 InstanceType: g4dn.xlarge KeyName: tmp-key SubnetId: !Ref PrivateSubnetOne SecurityGroupIds: - !Ref ContainerSecurityGroup UserData: Fn::Base64: !Sub | #!/bin/bash echo ECS_CLUSTER=traffic-data-cluster >> /etc/ecs/ecs.config echo ECS_ENABLED_GPU_SUPPORT=true >> /etc/ecs/ecs.config ``` Dockerfile ``` FROM nvidia/cuda:11.6.0-base-ubuntu20.04 ENV NVIDIA_VISIBLE_DEVICES all ENV NVIDIA_DRIVER_CAPABILITIES compute,utility # RUN nvidia-smi RUN echo 'install pip packages' RUN apt-get update RUN apt-get install python3.8 -y RUN apt-get install python3-pip -y RUN ln -s /usr/bin/python3 /usr/bin/python RUN pip3 --version RUN python --version WORKDIR / COPY deployment/video-blurring/requirements.txt /requirements.txt RUN pip3 install --upgrade pip RUN pip3 install --user -r /requirements.txt ## Set up the requisite environment variables that will be passed during the build stage ARG SERVER_ID ARG SERVERLESS_STAGE ARG SERVERLESS_REGION ENV SERVER_ID=$SERVER_ID ENV SERVERLESS_STAGE=$SERVERLESS_STAGE ENV SERVERLESS_REGION=$SERVERLESS_REGION COPY config/env-vars . ## Sets up the entry point for running the bashrc which contains environment variable and ## trigger the python task handler COPY script/*.sh / RUN ["chmod", "+x", "./initialise_task.sh"] ## Copy the code to /var/runtime - following the AWS lambda convention ## Use ADD to preserve the underlying directory structure ADD src /var/runtime/ ENTRYPOINT ./initialise_task.sh ```
0
answers
0
votes
2
views
AWS-User-6797102
asked 20 days ago

Instances can't reach classic ELB in VPC after ENI change

Four or five times in the past 6-8 weeks, we've had situations where one of our ec2 instances (running CentOS) cannot reach the private IP address of a classic ELB. I believe this is due to scaling events (or something else causing replacement of ELB components) happening on the ELB. From what I see in cloud trial, the network interface is replaced with one having the same ip address but a different mac address. Sometimes, but not all the time, the old mac address gets stuck in the instance's arp cache (in REACHABLE state), preventing the instance from communicating with the ELB causing drastic issues for our application. If I manually delete the entry from the arp cache, things start working again. This is happening across different environments, so multiple subnets, multiple ELBs and multiple ec2 instances. These environments and components have been running for years without seeing this issue before. The only network config change we've recently made is to disable jumbo frames earlier this year, but don't see how that would impact this. Any ideas how to fix this? Thanks EDIT: this happened again today and I was able to more closely examine things. The new ENI is actually re-using an ip address that had been used over a month prior. The old entry for said ip address is still listed in the arp cache with the prior MAC address, despite not being used for about four weeks. This explains why it's starting to happen more frequently, as the chance that an ip address gets re-used increases as new ENIs are created for the ELBs. It's a /26 subnet so not a lot of addresses to choose from.
1
answers
0
votes
2
views
AWS-User-4421521
asked 21 days ago

Horizontal Scaling concerns, SSL issue with NLB

note: I'm new to scaling and firstly seeking advice on the best practices for horizontal scaling **I have the following setup:** *EC2 Instances <-> ASG(created from Launch template) -> TG <-> ALB <-> TG <-> NLB* Traffic flows through NLB to ALB and finally to EC2 instances configured via ASG. note: I'm assuming the above setup is the best one to go with horizontal scaling, if not please let me know. the above setup works fine for HTTP whereas when I try to configure HTTPS, I don't see options to do so. Issue1: Target Group(TG) doesn’t allow to create one with Load Balancer type with TLS port: 443 but allows only TCP: port 80, **Question1: **how else should I redirect HTTPS traffic to ALB? note: I need NLB because ALB doesn't provide Static IPs **Question2:** wrt Static IPs: NLB doesn't allow <2 AZs which means I need to have 2 Static IPs linked to my domain? any inputs would be really helpful! **Update1:** I've configured like below: In ALB listeners: HTTP(80) gets redirected to HTTPS HTTPS(443) gets forwarded to ASG In NLB listeners: HTTP(80) gets forwarded to ALB note: ALB's public URL is added to my domain(sample-alb.domain.com) NLB's public URL is added to my domain(sample-nlb.domain.com) SSL works fine if the user enters by hitting sample-alb.domain.com whereas if the user enters by hitting sample-nlb.domain.com, it always fails with "ERR_CERT_INVALID" any inputs on why this fails? **Update2:** I've got the answer to my Issue1/Question1 on how to redirect HTTPS traffic to ALB from here: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/application-load-balancer-target.html#configure-application-load-balancer-target > **Listeners and routing** > For Listeners, the default is a listener that accepts TCP traffic on port 80. Only TCP listeners can forward traffic to an Application Load Balancer target group. Keep the listener protocol set to TCP, but you can modify the port as required. > > This setup allows you to use HTTPS listeners on the Application Load Balancer to terminate the TLS protocol. so, I created a TG with TCP port 80 and listener to NLB, which redirects to ALB. (say for ex my NLB's public URL is 'nlb34323.amazonaws.com') now, when I hit my NLB's public URL with 'http://nlb34323.amazonaws.com', it does get redirected to 'https://nlb34323.amazonaws.com', but eventually fails with a timeout error. note: whereas when I hit ALB's public URL, it is working fine does it have anything to do with TLS termination as mentioned in the above documentation: > This setup allows you to use HTTPS listeners on the Application Load Balancer to terminate the TLS protocol. what am I doing wrong here?
2
answers
0
votes
6
views
Saru
asked 23 days ago
  • 1
  • 90 / page