Intermittent Timeouts when invoking Parameter Store or Secrets Manager from Lambda

0

I have this issue where my dev environment infrastructure is failing to connect to our database by timing out and generally timing out when I try to use the services. Initially, I thought it was because of Aurora Serverless, but this turns out to be a symptom and not the cause.

The issue seems that often (7 times out of 10), the Lambdas time out when attempting to fetch Parameters from SSM. I tried to switch this to SecretsManager, but had exactly the same issue.

What I don't understand is that our production environment, running in the same account, with almost the same configuration (since this is created via Pulumi) does not have these issues. The only differences are about the Database Scaling timeouts, and Lambda Log Retention - neither of which should affect this once the database has started (something I have even forced, and still have this issue). Note that our lambdas are within a VPC. Production and Dev have separate VPCs, but exactly the same config.

I've enabled logging on the SSM Client and this is what I see:

{
    "level": "debug",
    "message": "endpoints Resolved endpoint: {\n  \"headers\": {},\n  \"properties\": {},\n  \"url\": \"https://ssm.eu-west-1.amazonaws.com/\"\n}",
    "timestamp": "2023-09-01 14:38:41:3841"
}

{
    "level": "error",
    "message": {
        "clientName": "SSMClient",
        "commandName": "GetParametersByPathCommand",
        "error": {
            "$metadata": {
                "attempts": 3,
                "totalRetryDelay": 212
            },
            "address": "67.220.224.4",
            "code": "ETIMEDOUT",
            "errno": -110,
            "name": "TimeoutError",
            "port": 443,
            "syscall": "connect"
        },
        "input": {
            "Path": "/dev",
            "WithDecryption": true
        },
        "metadata": {
            "attempts": 3,
            "totalRetryDelay": 212
        }
    },
    "timestamp": "2023-09-01 14:45:13:4513"
}

I'm at a loss as to what causes these timeouts. They're not consistent, sometimes it just works. It's only on our Dev environment, which has the same configuration in terms of Lambda, VPC and SSM as our Production environment. This is a complete pain as our CI/CD flow relies on these Lambdas starting and running migrations and performing smoke tests, so it can sometimes block a Production deployment for an hour by having to re-run the Dev deployment after every timeout until it succeeds.

已提问 8 个月前543 查看次数
2 回答
0
已接受的回答

Found it.

One of my Network Interfaces for my lambdas didn't have an IP address associated with it. No idea how that happened, or where along the way this changed, but that was it. I allocated a new elastic IP and associated it with the second Network Interface that was linked to my dev environment lambda.

Everything is working flawlessly now.

已回答 8 个月前
0

Given above scenario, which is hard to debug/troubleshoot with the info, have you considered using lambda extension for parameter store and secrets manager? https://aws.amazon.com/blogs/compute/using-the-aws-parameter-and-secrets-lambda-extension-to-cache-parameters-and-secrets/.

As data is cache, this could reduce the need to make api requests.

profile picture
专家
已回答 8 个月前
  • Ye, this system was built 3 years ago - that didn't exist at the time. Good idea though, we probably should consider moving to that, it'll likely be cheaper and faster. Thanks for the input!

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则