Various AWS APIs fail due to timeout

1

I use AWS Batch on top of ECS. Today, I see lots of programs that use AWS API randomly fail (roughly 70%) due to timeout to connect to AWS endpoints. (i.e. S3, ECR, SecretsManager)

botocore.exceptions.MetadataRetrievalError: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Connect timeout on endpoint URL: http://xxxxx
subprocess.CalledProcessError: Command '['aws', 'ecr', 'get-login-password', '--region', 'us-west-1']' returned non-zero exit status 255

Even worse than that, some Batch jobs fail to start due to failure to connect to Cloudwatch on the very beginning of the deployment:

CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: CredentialsEndpointError: failed to load credentials caused by: RequestError: send request failed caused by: Get "http://16

As far as I check the AWS Health Dashboard, I see all green... Does anyone know what should we check to address this issue?

UPDATED:

I found that EC2 instances that fails to start AWS Batch jobs has no access to 169.254.170.2. (i.e. On a failed instance, curl http://localhost:51679/ returns a response, but curl http://169.254.170.2/ never returns any response. On a working instance, both return responses.) I compared iptables list of two instances (one that successfully starts a job and another that fails to start a job), but could not find any difference. (I see only the redirect setting that is described on the README https://github.com/aws/amazon-ecs-agent/blob/master/README.md ) and the net.ipv4.conf.all.route_localnet=1 is set to both instances.

I checked iptables but no packet reached to the OUTPUT policy to redirect it to the 51679. (And I don't manipulate any rule/policy on iptables, just use as-is configured by the stock AMI)

$ sudo iptables -t nat -L -v -n
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            169.254.170.2        tcp dpt:80 to:127.0.0.1:51679

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 REDIRECT   tcp  --  *      *       0.0.0.0/0            169.254.170.2        tcp dpt:80 redir ports 51679

I hooked the TRACE rule on the packets:

sudo iptables -t raw -A PREROUTING -d 169.254.170.2/32 -p tcp -m tcp --dport 80 -j TRACE
sudo iptables -t raw -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j TRACE

And I found that there are TRACE: filter:OUTPUT:policy but no TRACE: filter:OUTPUT:rule in log messages:

Jul 19 19:48:46 ip-172-25-2-70.lax.internal.gitai.tech kernel: TRACE: raw:OUTPUT:policy:3 IN= OUT=eth0 SRC=172.25.2.70 DST=169.254.170.2 LEN=60 TOS=0x00 PREC=0x00 TTL=255 ID=13131 DF PROTO=TCP SPT=34784 DPT=80 SEQ=1439278992 ACK=0 WINDOW=35844 RES=0x00 SYN URGP=0 OPT (020423010402080A6DBF9C99000000000103030E) UID=1000 GID=1000 
Jul 19 19:48:46 ip-172-25-2-70.lax.internal.gitai.tech kernel: TRACE: filter:OUTPUT:policy:1 IN= OUT=eth0 SRC=172.25.2.70 DST=169.254.170.2 LEN=60 TOS=0x00 PREC=0x00 TTL=255 ID=13131 DF PROTO=TCP SPT=34784 DPT=80 SEQ=1439278992 ACK=0 WINDOW=35844 RES=0x00 SYN URGP=0 OPT (020423010402080A6DBF9C99000000000103030E) UID=1000 GID=1000 
Jul 19 19:48:47 ip-172-25-2-70.lax.internal.gitai.tech kernel: TRACE: raw:OUTPUT:policy:3 IN= OUT=eth0 SRC=172.25.2.70 DST=169.254.170.2 LEN=60 TOS=0x00 PREC=0x00 TTL=255 ID=13132 DF PROTO=TCP SPT=34784 DPT=80 SEQ=1439278992 ACK=0 WINDOW=35844 RES=0x00 SYN URGP=0 OPT (020423010402080A6DBFA0D1000000000103030E) UID=1000 GID=1000 
Jul 19 19:48:47 ip-172-25-2-70.lax.internal.gitai.tech kernel: TRACE: filter:OUTPUT:policy:1 IN= OUT=eth0 SRC=172.25.2.70 DST=169.254.170.2 LEN=60 TOS=0x00 PREC=0x00 TTL=255 ID=13132 DF PROTO=TCP SPT=34784 DPT=80 SEQ=1439278992 ACK=0 WINDOW=35844 RES=0x00 SYN URGP=0 OPT (020423010402080A6DBFA0D1000000000103030E) UID=1000 GID=1000 

I still have no idea why packets to 169.254.170.2 is dropped on the failure instances...

2 Answers
0
Accepted Answer

I finally identified the cause and found a solution.

The problem is with the contamination of the variant of the iptables (legacy and nftables).

This occurs when we use on AWS Batch, container images that contain iptables that use nftables as its backend by default, such as Ubuntu 22.04.

Our container image used in AWS Batch was configured to start a docker-in-docker on its startup and the --priviledged option was enabled to make this work.

The docker daemon uses the iptables internally and it loads nftables on the host OS kernel, and breaks the legacy iptables settings, such as port redirection, which is used by AWS ECS Agent.

The clue was the message below left in the kernel messages only of the instance that did not work:

[ 181.758769] nf_tables: (c) 2007-2009 Patrick McHardy <kaber@trash.net>
[ 182.027143] nf_tables_compat: (c) 2012 Pablo Neira Ayuso <pablo@netfilter.org>

The problem was solved by configuring the image to use iptables-legacy as well as Amazon Linux.

# Dockerfile
RUN update-alternatives --set iptables /usr/sbin/iptables-legacy
RUN update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
yfgitai
answered 9 months ago
0

Is there a route set up in the route table for the subnet where the AWS Batch task is launched to go out to the public network?
Do you have VPC endpoints set up, etc. even if there is no public network?
Also, have you set up the necessary outbound and inbound rules for network ACLs and security groups?

The error at the bottom appears to be a failure to obtain authentication information.
I thought I was experiencing the same problem as here.
https://github.com/aws/aws-cdk/issues/5954

profile picture
EXPERT
answered 9 months ago
  • Thank you for the comment!

    Is there a route set up in the route table for the subnet where the AWS Batch task is launched to go out to the public network?

    Yes, we have.

    Do you have VPC endpoints set up, etc. even if there is no public network?

    No.

    have you set up the necessary outbound and inbound rules for network ACLs and security groups?

    Can you please elaborate on which kind of rules we should set up?

    Until today, we have had no issue with the connectivity to the AWS services from the Batch job instances, and we now randomly see the issue. We don't change VPC settings recently.

  • If you have been using the system without any problems and have not changed any settings around the network, then the check points I shared should not be a problem.
    I found a similar issue to yours on a GitHub issue. It seems that the error can be avoided by setting "ECS_TASK_METADATA_RPS_LIMIT" in the ECS container agent. Note that "ECS_TASK_METADATA_RPS_LIMIT" can only be changed for EC2 boot type. https://github.com/aws/amazon-ecs-agent/issues/1262
    Please refer to the following document for container agent configuration. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html
    If you are using Fargate, implementing a credential cache or similar may alleviate throttling errors to the endpoint. https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/guide_configuration.html#config_credentials

  • Thank you for the additional comments! As seen in the post I updated, I confirmed the iptables routing for ECS agent did not work on failed instances. I still have no idea why it happens, though.

  • One possible cause is that there are more accesses to the AWS API, which causes more accesses to the metadata endpoints, which may cause something similar to a throttling error. So, you may want to adjust the "ECS_TASK_METADATA_RPS_LIMIT" of the container agent as I shared.

  • Thank you for the comment.

    So, you may want to adjust the "ECS_TASK_METADATA_RPS_LIMIT" of the container agent as I shared.

    If it is the case, I cannot access the metadata both from http://localhost:51679 and http://169.254.170.2/. However, I can still access it from http://localhost:51679. I think the cause is any packet to the destination 169.254.170.2 is dropped and never reached to an ECS agent.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions