Intermittent Timeouts when invoking Parameter Store or Secrets Manager from Lambda

0

I have this issue where my dev environment infrastructure is failing to connect to our database by timing out and generally timing out when I try to use the services. Initially, I thought it was because of Aurora Serverless, but this turns out to be a symptom and not the cause.

The issue seems that often (7 times out of 10), the Lambdas time out when attempting to fetch Parameters from SSM. I tried to switch this to SecretsManager, but had exactly the same issue.

What I don't understand is that our production environment, running in the same account, with almost the same configuration (since this is created via Pulumi) does not have these issues. The only differences are about the Database Scaling timeouts, and Lambda Log Retention - neither of which should affect this once the database has started (something I have even forced, and still have this issue). Note that our lambdas are within a VPC. Production and Dev have separate VPCs, but exactly the same config.

I've enabled logging on the SSM Client and this is what I see:

{
    "level": "debug",
    "message": "endpoints Resolved endpoint: {\n  \"headers\": {},\n  \"properties\": {},\n  \"url\": \"https://ssm.eu-west-1.amazonaws.com/\"\n}",
    "timestamp": "2023-09-01 14:38:41:3841"
}

{
    "level": "error",
    "message": {
        "clientName": "SSMClient",
        "commandName": "GetParametersByPathCommand",
        "error": {
            "$metadata": {
                "attempts": 3,
                "totalRetryDelay": 212
            },
            "address": "67.220.224.4",
            "code": "ETIMEDOUT",
            "errno": -110,
            "name": "TimeoutError",
            "port": 443,
            "syscall": "connect"
        },
        "input": {
            "Path": "/dev",
            "WithDecryption": true
        },
        "metadata": {
            "attempts": 3,
            "totalRetryDelay": 212
        }
    },
    "timestamp": "2023-09-01 14:45:13:4513"
}

I'm at a loss as to what causes these timeouts. They're not consistent, sometimes it just works. It's only on our Dev environment, which has the same configuration in terms of Lambda, VPC and SSM as our Production environment. This is a complete pain as our CI/CD flow relies on these Lambdas starting and running migrations and performing smoke tests, so it can sometimes block a Production deployment for an hour by having to re-run the Dev deployment after every timeout until it succeeds.

asked 8 months ago515 views
2 Answers
0
Accepted Answer

Found it.

One of my Network Interfaces for my lambdas didn't have an IP address associated with it. No idea how that happened, or where along the way this changed, but that was it. I allocated a new elastic IP and associated it with the second Network Interface that was linked to my dev environment lambda.

Everything is working flawlessly now.

answered 8 months ago
0

Given above scenario, which is hard to debug/troubleshoot with the info, have you considered using lambda extension for parameter store and secrets manager? https://aws.amazon.com/blogs/compute/using-the-aws-parameter-and-secrets-lambda-extension-to-cache-parameters-and-secrets/.

As data is cache, this could reduce the need to make api requests.

profile picture
EXPERT
answered 8 months ago
  • Ye, this system was built 3 years ago - that didn't exist at the time. Good idea though, we probably should consider moving to that, it'll likely be cheaper and faster. Thanks for the input!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions