Lambda Stability issue

0

Hello,

We're having trouble with the one of the lambda function. Sometimes it works when we use it, but other times we get a timeout error. We noticed that it starts working again temporarily if we make changes to the lambda function, but then it stops working after a while.

Here's what we've tried so far:

  1. Checked all the security groups to ensure they match the ones used in the staging lambda environment.
  2. Assigned public IP addresses to all of the lambda function's network interfaces, thinking that internet traffic might be causing the issue. This method resolved the problem in the staging environment.
  3. Added a new security group to the lambda function to allow all traffic, but the issue persisted.

It's worth noting that the function works fine in the staging environment on AWS.

We've noticed a pattern:

  1. The function tends to work after we make changes to it. For example, when we add or remove a security group or change the runtime from nodejs 20x to 18x, it starts working again temporarily. However, even if it works when we test it manually after these changes, it eventually stops working again until we make more changes.
  2. assuming when we call multiple times from frontend but we have not went live yet and hardly users are not even 4 to 5 on that.

This seems to be the pattern we've observed.

Rakesh
asked a month ago92 views
4 Answers
1

This sounds like a code error. When you invoke a function for the first time, we initialize an execution environment. When you invoke it again, we reuse the previous execution environment. When you make changes, we create a new one.

It is possible that you have some memory leak, or some other code mistake, that causes the code to get into a loop when invoking several times. I recommend that you add some debug messages and try it again to find out where it is egtting stuck in your code.

profile pictureAWS
EXPERT
Uri
answered a month ago
profile picture
EXPERT
reviewed a month ago
  • Hey Uri, Thank for your response.

    if it is a code error then it shouldn't work with another lambda function in another region (us-west-2).

0

How are you triggering the lambda? Have you investigated the cloudwatch logs? What are the metrics showing?

It sounds like you are timing out as its taking too long to run. This could be many reasons such as a resource you are referencing is timing out itself. Your function may be performing too much and is not having time to finish.

You cant assign a Public IP address to a Lambda function. Security groups are for outbound access only with Lambda. Theres no inbound traffic for lambda. What do your subnets and route tables look like? Is your ENI on a Public or Private Subnet?

profile picture
EXPERT
answered a month ago
  • Hey Gary Mclean, Thank for your response.

    we are not triggering lambda from code and the server hosted in same region and it is connected thru AWS CLI.

    we have assigned public IP's to lambda service security group cause function task is to connect with snowflake and fetch data(outbound).

0

I don't know if your issue is the same as I got in my environment. I was thinking about opening a new post, and then I saw you that looks like the same issue I'm figuring now.

I have now an environment with a lambda, that is based on an ECR image. So it's not the only one, I have other environments with this setup.

That said: My environment consists of:

  1. API Gateway with LAMBDA integration
  2. EFS
  3. VPC ( it's working properly )
  4. ECR Image

What is my issue: When I made the first requests while Lambda was warmed up, it worked very well. But after some while, we started to have some instability during the requests, with task timeout.

What I tested: I changed the Lambda timeout to check if it's taking longer than necessary to execute ( does not make sense in my situation, because the process reaches the end ) and it doesn't finish and answers back with the generated response. I think it should be the connection between Lambda and API Gateway, that could maybe not start the ENI properly. So then I tested the function call using the console, and I configured a test call. I continued receiving task timeout, the same issue. But it only happens in two situations: If the Lambda function is idle Sometimes, after a long idle time, when I try to make requests, the Lambda function gets something like 1 hour to stop the task timeout issue.

Now, I'm trying to let the instance warm, and I think it could fix my issue.

I tried debugging my code, looking after the EFS throughput, and increasing my instance size, and what I noticed is, that if the instance is already available, it answers sometimes in ms, depending on the request. And the problem continued even though I used provisioned instances as well.

So I don't know if it's the same issue you are figuring, but I have had 2 conclusions, correct me if I'm wrong: During the day, if your instance is idle, AWS manages the resources ( Lambda is serverless, so you need to redistribute what's not being used, Am I right? ) and that's when I have 1 hour of unavailability. So if in my situation I need more availability, I need to warm up my lambda or use another AWS service, like ECS.

Did this answer your question, or help figuring out something new?

What are your AWS experts' thoughts about this?

Warm regards

Ernani
answered a month ago
profile picture
EXPERT
reviewed a month ago
0

Please refer to my post, where I fixed a similar issue:

https://repost.aws/questions/QU75AxAy1eSD29YF2rUTfl2g/lambda-instability-sometimes

Ernani
answered 19 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions