ECS agent sporadically times out while fetching secrets from SSM Parameter Store

0

We have an ECS cluster in us-west-2 that runs a few ECS services. We run some ECS tasks that are invoked periodically via EventBridge. All tasks use the EC2 launch type and run on container instances that we manage with an Auto Scaling Group. AMI used currently is amzn2-ami-ecs-hvm-2.0.20220630-x86_64-ebs. Container instances are launched in private subnets and VPC endpoints are set up for a few AWS services, including SSM.

A few months ago we started seeing missed checkins from the periodically launched tasks and saw that at least some of them failed to launch due to a timeout from the SSM API endpoint.

In ecs-agent's log, it shows up like:

level=error time=2022-09-19T22:30:56Z msg="Failed to create task resource" error="fetching secret data from SSM Parameter Store in us-west-2: RequestError: send request failed\ncaused by: Post "https://ssm.us-west-2.amazonaws.com/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" task="..." resource="ssmsecret" level=info time=2022-09-19T22:30:56Z msg="Setting terminal reason for task" reason="fetching secret data from SSM Parameter Store in us-west-2: Request Error: send request failed\ncaused by: Post "https://ssm.us-west-2.amazonaws.com/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" task="..."

We tried increasing the throughput of SSM Parameter Store through its settings, but it didn't seem to have an effect. https://docs.aws.amazon.com/systems-manager/latest/userguide/parameter-store-throughput.html

Other guides and Q&As I could find were about network misconfigurations that would lead to a complete inability to talk to SSM, whereas the symptom I'm seeing is only intermittent; the ECS tasks get launched without an issue most of the time. https://aws.amazon.com/premiumsupport/knowledge-center/ssm-tcp-timeout-error/

What could be the cause? What else can I look into?

  • Did you ever find a solution to this? I have a similar issue but from a Lambda instead of ECS.

  • Unfortunately no. One of the things I was thinking of trying was to use something like chamber [1] to fetch params on the application side so we have more control over retries, but I haven't gotten around to it. (Partly because the frequency of those failures seemed to have gotten somewhat lower on its own.)

    [1] https://github.com/segmentio/chamber

  • So what I had was that one network interface (of the 2 attached to my VPC) didn't have an IP address allocated with it. It seems like the VPC was alternating between the 2 network interfaces for connecting to SSM/SecretsManager and when it hit the one without IP, Amazon services tend to block that and let it time out.

質問済み 2年前113ビュー
回答なし

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン