Intermittent Timeouts when invoking Parameter Store or Secrets Manager from Lambda

0

I have this issue where my dev environment infrastructure is failing to connect to our database by timing out and generally timing out when I try to use the services. Initially, I thought it was because of Aurora Serverless, but this turns out to be a symptom and not the cause.

The issue seems that often (7 times out of 10), the Lambdas time out when attempting to fetch Parameters from SSM. I tried to switch this to SecretsManager, but had exactly the same issue.

What I don't understand is that our production environment, running in the same account, with almost the same configuration (since this is created via Pulumi) does not have these issues. The only differences are about the Database Scaling timeouts, and Lambda Log Retention - neither of which should affect this once the database has started (something I have even forced, and still have this issue). Note that our lambdas are within a VPC. Production and Dev have separate VPCs, but exactly the same config.

I've enabled logging on the SSM Client and this is what I see:

{
    "level": "debug",
    "message": "endpoints Resolved endpoint: {\n  \"headers\": {},\n  \"properties\": {},\n  \"url\": \"https://ssm.eu-west-1.amazonaws.com/\"\n}",
    "timestamp": "2023-09-01 14:38:41:3841"
}

{
    "level": "error",
    "message": {
        "clientName": "SSMClient",
        "commandName": "GetParametersByPathCommand",
        "error": {
            "$metadata": {
                "attempts": 3,
                "totalRetryDelay": 212
            },
            "address": "67.220.224.4",
            "code": "ETIMEDOUT",
            "errno": -110,
            "name": "TimeoutError",
            "port": 443,
            "syscall": "connect"
        },
        "input": {
            "Path": "/dev",
            "WithDecryption": true
        },
        "metadata": {
            "attempts": 3,
            "totalRetryDelay": 212
        }
    },
    "timestamp": "2023-09-01 14:45:13:4513"
}

I'm at a loss as to what causes these timeouts. They're not consistent, sometimes it just works. It's only on our Dev environment, which has the same configuration in terms of Lambda, VPC and SSM as our Production environment. This is a complete pain as our CI/CD flow relies on these Lambdas starting and running migrations and performing smoke tests, so it can sometimes block a Production deployment for an hour by having to re-run the Dev deployment after every timeout until it succeeds.

2개 답변
0
수락된 답변

Found it.

One of my Network Interfaces for my lambdas didn't have an IP address associated with it. No idea how that happened, or where along the way this changed, but that was it. I allocated a new elastic IP and associated it with the second Network Interface that was linked to my dev environment lambda.

Everything is working flawlessly now.

답변함 8달 전
0

Given above scenario, which is hard to debug/troubleshoot with the info, have you considered using lambda extension for parameter store and secrets manager? https://aws.amazon.com/blogs/compute/using-the-aws-parameter-and-secrets-lambda-extension-to-cache-parameters-and-secrets/.

As data is cache, this could reduce the need to make api requests.

profile picture
전문가
답변함 8달 전
  • Ye, this system was built 3 years ago - that didn't exist at the time. Good idea though, we probably should consider moving to that, it'll likely be cheaper and faster. Thanks for the input!

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠