Endpoint Discovery takes too long for default Lambda timeout

0

I'm using various SDK v3 clients in my lambda, IoT and Timestream for example, and sometimes when the lambda is updated or otherwise goes through a cold start, the endpoint discovery seems to take too long and the lambda (configured for the default 3 second timeout) times out. I've confirmed its a timeout via the CloudWatch logs which show the runtime as ~3000ms and a subsequent service failure, and in the case of API Gateway requests, the requesting client gets a HTTP 500 response. I can't be sure its the endpoint discovery, but see that this could be a lengthy, periodic operation especially after a cold start, and at whatever period the endpoint is cached for. I've seen occasional similar failures when I've not been hitting the lambdas for a while, and a retry always seems to work.

I've seen other posts suggesting not specifying the region in the client constructor options might be an issue, but not definitive. I assume that it would pick up the region from the environment, just the same as I would in order to pass it as an option when creating the client. I'm only targetting a single region at the moment, so no need to specify different regions for any particular reason, the default for the lambda should be sufficient. It's not clear in the docs if the client uses the environment if region or any other param is not set, so its possible that it does some multi-region lookup and that's slowing it down. Most of the time its fine, but sometimes glitches.

I don't particularly want to set an arbitrary longer timeout for the Lambda, as this could be problematic in other ways, and would think that internally AWS prioritises/optimises endpoint discovery to fall well within this, but there are occasions when this seems to fail. A subsequent call to the lambda works fine, I've yet to see consecutive requests fail, and I can't find docs to suggest if there is a minimum expected discovery time that I should be setting the lambdas for.

Ideas?

1 Answer
0

Hi,

if you don't explicitly specify the region in the SDK client, it will use the AWS_REGION environment variable that is automatically set by AWS Lambda.

Have you tried enabling tracing via X-Ray and instrumenting the SKD clients? That should give you much more detailed information on where the time is spent. Please refer to this document for more information.

profile pictureAWS
EXPERT
answered 9 months ago
  • Thanks, I have tried Xray, and find that I get a wide range of response times, well up into several seconds . Most are tens of ms, but I see plenty, not just around endpoint discovery, that are multi second for things like TimestreamWrite and DynamoDb Get and Update (soon after lambda restart and discounting endpoint discovery).

    I've increased my lambda timeout to 15s and now get few failures, but that doesn't resolve or explain why the individual clients would be so slow. If this is considered 'normal', then I probably need to consider a different strategy as this is just data ingestion.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions