- Newest
- Most votes
- Most comments
It is unusual that this has started to happen suddenly; my guess is that there is something in the code which is now taking a lot longer than it uses to - which is why you're hitting the runtime timeout. A further guess: There's an external process/host that is being called which isn't responding and the timeouts from that are pushing the execution time past the limit that is set (60 seconds).
In this case the best way to troubleshoot is to put instrumentation in your code - even something as simple as print
statements that indicate where the code is at any particular time. You might also use X-Ray - but it's completely up to you which tools to use.
That's going to help you identify what the code is doing each step of the way and which of those steps is taking the most time.
Only a guess here, but based on the name of the Lambda function, does it mount an EFS volume? Maybe there is some issue there? If you are sure that you did not change anything from your side, I would recommend creating a support case.
Are you using anu Lambda extensions?
Thank your for your response,
Yes we are mounting a EC2 instance (sys.path.append('/mnt/access'))) to the lambda function so we can load and execute python models on the fly under a few seconds. But the thing is we have had no issues doing this for 3 months. Pretty much same amount of data coming in and everything is pretty much the same but now its running into this issue. I have just commented out the part where we mount the EC2 instance and the rest of code. I will let it run for 12 hours or so and see if we get errors now. So now all the code does is take the incoming values from sensors and push it to AWS Sitewise using the boto3 library.
The only things we are using on Lambda is boto3 to communicate with AWS Sitewise and Timestream. Not sure if that's what your referring too.
Also as mentioned in my first post and the screenshots I provided, I cant understand why it says it timed out after 62 seconds but there is only 2 second difference between the CloudWatch events START to the END and REPORT events. Shouldn't the START log and the following logged events be at least 60 second apart?
Relevant content
- Accepted Answerasked 3 months ago
- Accepted Answerasked 6 months ago
- asked 2 years ago
- asked 4 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
Thank you for your response,
As mentioned above when the lambda throws the timeout logs on CloudWatch it does not show any of the print statement. I have set print statement on the first line of the handler and after each step. But none of them show when there is a timeout issue. After spending some more time what I did was comment out each part of the code and let it run to achieve the same result of knowing which part is responsible. I got to the point where I commented out all of the code and let it run for 12 hours or so. It seems like the timeout is still happening but alot less to the point it does not break the lambda. One thing to note is we are mounting a EC2 instance to the function where we load up some python models and execute them on the fly every time a message is received from sensor. I have removed the part where we actually load up the model and execute and I still had some timeout issues. Not as much but its happening. Now I have removed the part where we are mounting the EC2 instance (sys.path.append('/mnt/access')) to see without that part if we have issues. But thing that does not make sense is nothing has changed with model or how we load them up or the amount of date coming in. Why would this suddenly cause issues?
Also as mentioned in my first post and the screenshots I provided, I cant understand why it says it timed out after 62 seconds but there is only 2 second difference between the CloudWatch events START to the END and REPORT events
If you're not seeing any log messages then the timeout is happening outside the function body (i.e. in the code that is not inside
lambda_handler
) so you'll need to check there too. As far as the time difference goes: Perhaps the CloudWatch Logs entry is not being created at function launch? But instead is being created later?I have removed the section where we mounting the EC2 instance. I have also put print statements in TimeStream external script we used to run queries. I still get timeout issues with no logs. Only thing I can think of is maybe somehow the imports I have is causing issues? Also I have feeling the issue is the function doesn't even run. Even though the function concurrency is set to 1000 it seems sometimes when the signals are coming in at the same time, it causes the issue. The reason I say this is when we turn our AWS gateway off for sometime and then turn it back on, it throw's the time out errors rights away. So it seems that when gateway comes on it sends all the messages at once and it just crashes the function and it never recovers. Which means it always gets stuck in the timeout and none of the messages get through, without any logs.
Also I noticed there is this warning that shows before things go wrong. The warning is "OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k"
Not sure if this could be the issue. I dont see this warning last month when everything was running ok
I think the only way to troubleshoot this is to start removing (or commenting out) components/imports until you can determine what is the root cause. Unfortunately, sometimes brute-force is what is required.