ECS agent sporadically times out while fetching secrets from SSM Parameter Store

0

We have an ECS cluster in us-west-2 that runs a few ECS services. We run some ECS tasks that are invoked periodically via EventBridge. All tasks use the EC2 launch type and run on container instances that we manage with an Auto Scaling Group. AMI used currently is amzn2-ami-ecs-hvm-2.0.20220630-x86_64-ebs. Container instances are launched in private subnets and VPC endpoints are set up for a few AWS services, including SSM.

A few months ago we started seeing missed checkins from the periodically launched tasks and saw that at least some of them failed to launch due to a timeout from the SSM API endpoint.

In ecs-agent's log, it shows up like:

level=error time=2022-09-19T22:30:56Z msg="Failed to create task resource" error="fetching secret data from SSM Parameter Store in us-west-2: RequestError: send request failed\ncaused by: Post "https://ssm.us-west-2.amazonaws.com/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" task="..." resource="ssmsecret" level=info time=2022-09-19T22:30:56Z msg="Setting terminal reason for task" reason="fetching secret data from SSM Parameter Store in us-west-2: Request Error: send request failed\ncaused by: Post "https://ssm.us-west-2.amazonaws.com/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" task="..."

We tried increasing the throughput of SSM Parameter Store through its settings, but it didn't seem to have an effect. https://docs.aws.amazon.com/systems-manager/latest/userguide/parameter-store-throughput.html

Other guides and Q&As I could find were about network misconfigurations that would lead to a complete inability to talk to SSM, whereas the symptom I'm seeing is only intermittent; the ECS tasks get launched without an issue most of the time. https://aws.amazon.com/premiumsupport/knowledge-center/ssm-tcp-timeout-error/

What could be the cause? What else can I look into?

  • Did you ever find a solution to this? I have a similar issue but from a Lambda instead of ECS.

  • Unfortunately no. One of the things I was thinking of trying was to use something like chamber [1] to fetch params on the application side so we have more control over retries, but I haven't gotten around to it. (Partly because the frequency of those failures seemed to have gotten somewhat lower on its own.)

    [1] https://github.com/segmentio/chamber

  • So what I had was that one network interface (of the 2 attached to my VPC) didn't have an IP address allocated with it. It seems like the VPC was alternating between the 2 network interfaces for connecting to SSM/SecretsManager and when it hit the one without IP, Amazon services tend to block that and let it time out.

已提問 2 年前檢視次數 112 次
沒有答案

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南