Python Lambda Function Suddenly Timing Out

0

Hi team,

We have been using lambda function for several months without any issues. Just 2 days ago suddenly I started to get "Task timed out" errors in my CloudWatch logs and as time goes they increase to the point where the function does not work anymore. So at beginning there are only a few happening during the first hours and as time goes on they increase to the point where the function does not work anymore. None of the settings for our Lambda function has changed or the data coming in from IOT sensors which via a rule it invokes the lambda function. So there are about 150 sensors streaming every 10-15 seconds and as each value comes in the AWS IoT Core rule invokes the lambda function that reads the value and processes it. The time out configuration is set to 60 seconds and for some reason it reaches that value. The Lambda functions does a few things. It uses AWS SDK to write to AWS Sitewise and also it writes to AWS Timestream. I have tried to set some print statements for each line in my code so I can see where the issue is happening. But the problem is when there is a task timeout it never shows any of the print outputs on the CloudWatch logs. I have looked to see if there was any updates to Lambda or Timestream but I did not find anything. I have also seen some suggestions to increase the memory but looking at the logs it does not even get close to max memory. The one thing I find very strange is that on the CloudWatch logs I can see the three events Start, End and report. The timestamp of the start is actually only about 2 seconds apart from the end and report event which says "Task timed out after 62.01 seconds". If the start and end of the invocation is about 2 second apart than how did it read the max timeout of the function which is set to 1 min. Also I have seen some articles about enabling X-Ray tracing on the function but so far I have not got any valuable information from it. I have posted the logs from CloudWatch for one of the many "Task timed out" logs below. Please let me know if I have missed anything or something I can try to fix the issue or investigate the cause better. Any help woolen be appreciated.

Thank you for your time, Shervin

Enter image description here

2 Answers
0

It is unusual that this has started to happen suddenly; my guess is that there is something in the code which is now taking a lot longer than it uses to - which is why you're hitting the runtime timeout. A further guess: There's an external process/host that is being called which isn't responding and the timeouts from that are pushing the execution time past the limit that is set (60 seconds).

In this case the best way to troubleshoot is to put instrumentation in your code - even something as simple as print statements that indicate where the code is at any particular time. You might also use X-Ray - but it's completely up to you which tools to use.

That's going to help you identify what the code is doing each step of the way and which of those steps is taking the most time.

profile pictureAWS
EXPERT
answered a year ago
  • Thank you for your response,

    As mentioned above when the lambda throws the timeout logs on CloudWatch it does not show any of the print statement. I have set print statement on the first line of the handler and after each step. But none of them show when there is a timeout issue. After spending some more time what I did was comment out each part of the code and let it run to achieve the same result of knowing which part is responsible. I got to the point where I commented out all of the code and let it run for 12 hours or so. It seems like the timeout is still happening but alot less to the point it does not break the lambda. One thing to note is we are mounting a EC2 instance to the function where we load up some python models and execute them on the fly every time a message is received from sensor. I have removed the part where we actually load up the model and execute and I still had some timeout issues. Not as much but its happening. Now I have removed the part where we are mounting the EC2 instance (sys.path.append('/mnt/access')) to see without that part if we have issues. But thing that does not make sense is nothing has changed with model or how we load them up or the amount of date coming in. Why would this suddenly cause issues?

    Also as mentioned in my first post and the screenshots I provided, I cant understand why it says it timed out after 62 seconds but there is only 2 second difference between the CloudWatch events START to the END and REPORT events

  • If you're not seeing any log messages then the timeout is happening outside the function body (i.e. in the code that is not inside lambda_handler) so you'll need to check there too. As far as the time difference goes: Perhaps the CloudWatch Logs entry is not being created at function launch? But instead is being created later?

  • I have removed the section where we mounting the EC2 instance. I have also put print statements in TimeStream external script we used to run queries. I still get timeout issues with no logs. Only thing I can think of is maybe somehow the imports I have is causing issues? Also I have feeling the issue is the function doesn't even run. Even though the function concurrency is set to 1000 it seems sometimes when the signals are coming in at the same time, it causes the issue. The reason I say this is when we turn our AWS gateway off for sometime and then turn it back on, it throw's the time out errors rights away. So it seems that when gateway comes on it sends all the messages at once and it just crashes the function and it never recovers. Which means it always gets stuck in the timeout and none of the messages get through, without any logs.

    Also I noticed there is this warning that shows before things go wrong. The warning is "OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k"

    Not sure if this could be the issue. I dont see this warning last month when everything was running ok

  • I think the only way to troubleshoot this is to start removing (or commenting out) components/imports until you can determine what is the root cause. Unfortunately, sometimes brute-force is what is required.

0

Only a guess here, but based on the name of the Lambda function, does it mount an EFS volume? Maybe there is some issue there? If you are sure that you did not change anything from your side, I would recommend creating a support case.

Are you using anu Lambda extensions?

profile pictureAWS
EXPERT
Uri
answered a year ago
  • Thank your for your response,

    Yes we are mounting a EC2 instance (sys.path.append('/mnt/access'))) to the lambda function so we can load and execute python models on the fly under a few seconds. But the thing is we have had no issues doing this for 3 months. Pretty much same amount of data coming in and everything is pretty much the same but now its running into this issue. I have just commented out the part where we mount the EC2 instance and the rest of code. I will let it run for 12 hours or so and see if we get errors now. So now all the code does is take the incoming values from sensors and push it to AWS Sitewise using the boto3 library.

    The only things we are using on Lambda is boto3 to communicate with AWS Sitewise and Timestream. Not sure if that's what your referring too.

    Also as mentioned in my first post and the screenshots I provided, I cant understand why it says it timed out after 62 seconds but there is only 2 second difference between the CloudWatch events START to the END and REPORT events. Shouldn't the START log and the following logged events be at least 60 second apart?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions