I have this issue where my dev environment infrastructure is failing to connect to our database by timing out and generally timing out when I try to use the services. Initially, I thought it was because of Aurora Serverless, but this turns out to be a symptom and not the cause.
The issue seems that often (7 times out of 10), the Lambdas time out when attempting to fetch Parameters from SSM. I tried to switch this to SecretsManager, but had exactly the same issue.
What I don't understand is that our production environment, running in the same account, with almost the same configuration (since this is created via Pulumi) does not have these issues. The only differences are about the Database Scaling timeouts, and Lambda Log Retention - neither of which should affect this once the database has started (something I have even forced, and still have this issue). Note that our lambdas are within a VPC. Production and Dev have separate VPCs, but exactly the same config.
I've enabled logging on the SSM Client and this is what I see:
{
"level": "debug",
"message": "endpoints Resolved endpoint: {\n \"headers\": {},\n \"properties\": {},\n \"url\": \"https://ssm.eu-west-1.amazonaws.com/\"\n}",
"timestamp": "2023-09-01 14:38:41:3841"
}
{
"level": "error",
"message": {
"clientName": "SSMClient",
"commandName": "GetParametersByPathCommand",
"error": {
"$metadata": {
"attempts": 3,
"totalRetryDelay": 212
},
"address": "67.220.224.4",
"code": "ETIMEDOUT",
"errno": -110,
"name": "TimeoutError",
"port": 443,
"syscall": "connect"
},
"input": {
"Path": "/dev",
"WithDecryption": true
},
"metadata": {
"attempts": 3,
"totalRetryDelay": 212
}
},
"timestamp": "2023-09-01 14:45:13:4513"
}
I'm at a loss as to what causes these timeouts. They're not consistent, sometimes it just works. It's only on our Dev environment, which has the same configuration in terms of Lambda, VPC and SSM as our Production environment. This is a complete pain as our CI/CD flow relies on these Lambdas starting and running migrations and performing smoke tests, so it can sometimes block a Production deployment for an hour by having to re-run the Dev deployment after every timeout until it succeeds.
Ye, this system was built 3 years ago - that didn't exist at the time. Good idea though, we probably should consider moving to that, it'll likely be cheaper and faster. Thanks for the input!