I use AWS Batch on top of ECS.
Today, I see lots of programs that use AWS API randomly fail (roughly 70%) due to timeout to connect to AWS endpoints. (i.e. S3, ECR, SecretsManager)
botocore.exceptions.MetadataRetrievalError: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Connect timeout on endpoint URL: http://xxxxx
subprocess.CalledProcessError: Command '['aws', 'ecr', 'get-login-password', '--region', 'us-west-1']' returned non-zero exit status 255
Even worse than that, some Batch jobs fail to start due to failure to connect to Cloudwatch on the very beginning of the deployment:
CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: CredentialsEndpointError: failed to load credentials caused by: RequestError: send request failed caused by: Get "http://16
As far as I check the AWS Health Dashboard, I see all green...
Does anyone know what should we check to address this issue?
UPDATED:
I found that EC2 instances that fails to start AWS Batch jobs has no access to 169.254.170.2
.
(i.e. On a failed instance, curl http://localhost:51679/
returns a response, but curl http://169.254.170.2/
never returns any response. On a working instance, both return responses.)
I compared iptables list of two instances (one that successfully starts a job and another that fails to start a job), but could not find any difference.
(I see only the redirect setting that is described on the README https://github.com/aws/amazon-ecs-agent/blob/master/README.md )
and the net.ipv4.conf.all.route_localnet=1
is set to both instances.
I checked iptables
but no packet reached to the OUTPUT policy to redirect it to the 51679.
(And I don't manipulate any rule/policy on iptables, just use as-is configured by the stock AMI)
$ sudo iptables -t nat -L -v -n
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
0 0 DNAT tcp -- * * 0.0.0.0/0 169.254.170.2 tcp dpt:80 to:127.0.0.1:51679
Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
0 0 REDIRECT tcp -- * * 0.0.0.0/0 169.254.170.2 tcp dpt:80 redir ports 51679
I hooked the TRACE rule on the packets:
sudo iptables -t raw -A PREROUTING -d 169.254.170.2/32 -p tcp -m tcp --dport 80 -j TRACE
sudo iptables -t raw -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j TRACE
And I found that there are TRACE: filter:OUTPUT:policy
but no TRACE: filter:OUTPUT:rule
in log messages:
Jul 19 19:48:46 ip-172-25-2-70.lax.internal.gitai.tech kernel: TRACE: raw:OUTPUT:policy:3 IN= OUT=eth0 SRC=172.25.2.70 DST=169.254.170.2 LEN=60 TOS=0x00 PREC=0x00 TTL=255 ID=13131 DF PROTO=TCP SPT=34784 DPT=80 SEQ=1439278992 ACK=0 WINDOW=35844 RES=0x00 SYN URGP=0 OPT (020423010402080A6DBF9C99000000000103030E) UID=1000 GID=1000
Jul 19 19:48:46 ip-172-25-2-70.lax.internal.gitai.tech kernel: TRACE: filter:OUTPUT:policy:1 IN= OUT=eth0 SRC=172.25.2.70 DST=169.254.170.2 LEN=60 TOS=0x00 PREC=0x00 TTL=255 ID=13131 DF PROTO=TCP SPT=34784 DPT=80 SEQ=1439278992 ACK=0 WINDOW=35844 RES=0x00 SYN URGP=0 OPT (020423010402080A6DBF9C99000000000103030E) UID=1000 GID=1000
Jul 19 19:48:47 ip-172-25-2-70.lax.internal.gitai.tech kernel: TRACE: raw:OUTPUT:policy:3 IN= OUT=eth0 SRC=172.25.2.70 DST=169.254.170.2 LEN=60 TOS=0x00 PREC=0x00 TTL=255 ID=13132 DF PROTO=TCP SPT=34784 DPT=80 SEQ=1439278992 ACK=0 WINDOW=35844 RES=0x00 SYN URGP=0 OPT (020423010402080A6DBFA0D1000000000103030E) UID=1000 GID=1000
Jul 19 19:48:47 ip-172-25-2-70.lax.internal.gitai.tech kernel: TRACE: filter:OUTPUT:policy:1 IN= OUT=eth0 SRC=172.25.2.70 DST=169.254.170.2 LEN=60 TOS=0x00 PREC=0x00 TTL=255 ID=13132 DF PROTO=TCP SPT=34784 DPT=80 SEQ=1439278992 ACK=0 WINDOW=35844 RES=0x00 SYN URGP=0 OPT (020423010402080A6DBFA0D1000000000103030E) UID=1000 GID=1000
I still have no idea why packets to 169.254.170.2
is dropped on the failure instances...
Thank you for the comment!
Yes, we have.
No.
Can you please elaborate on which kind of rules we should set up?
Until today, we have had no issue with the connectivity to the AWS services from the Batch job instances, and we now randomly see the issue. We don't change VPC settings recently.
If you have been using the system without any problems and have not changed any settings around the network, then the check points I shared should not be a problem.
I found a similar issue to yours on a GitHub issue. It seems that the error can be avoided by setting "ECS_TASK_METADATA_RPS_LIMIT" in the ECS container agent. Note that "ECS_TASK_METADATA_RPS_LIMIT" can only be changed for EC2 boot type. https://github.com/aws/amazon-ecs-agent/issues/1262
Please refer to the following document for container agent configuration. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html
If you are using Fargate, implementing a credential cache or similar may alleviate throttling errors to the endpoint. https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/guide_configuration.html#config_credentials
Thank you for the additional comments! As seen in the post I updated, I confirmed the iptables routing for ECS agent did not work on failed instances. I still have no idea why it happens, though.
One possible cause is that there are more accesses to the AWS API, which causes more accesses to the metadata endpoints, which may cause something similar to a throttling error. So, you may want to adjust the "ECS_TASK_METADATA_RPS_LIMIT" of the container agent as I shared.
Thank you for the comment.
If it is the case, I cannot access the metadata both from
http://localhost:51679
andhttp://169.254.170.2/
. However, I can still access it fromhttp://localhost:51679
. I think the cause is any packet to the destination 169.254.170.2 is dropped and never reached to an ECS agent.