Python lambda failing to initialize RSA public key occasionally

0

I'm trying to create a custom request authorizer working with several user pools, in Python.

So to validate tokens I tried first with pyjwk/cryptography

 claims = jwt.decode(token, options={"verify_signature": False, "require":["iss"]})
 issuer = claims['iss']
 jwks_client = PyJWKClient(issuer+"/.well-known/jwks.json",False)
 signing_key = jwks_client.get_signing_key_from_jwt(token)

Occasionally, about 5% of the time, lambda instance will just timeout on this last line, even with 30 second run time.

Thought maybe it was network, rewrote it to get the JWK through requests and initialize the key with RSAAlgorithm.from_jwk, nope - the JWK is retrieved, but it's initializing the key that fails.

Called RSAAlgorithm.from_jwk outside the handle method with dummy hardcoded JWK to move initialization of cryptography to init stage; handler method works smoother now, instead of being slow on the first invocation, but the random failure still happens.

Thought maybe it was cryptography or pyjwk, switched to python-jose and it's different backends. Nope - still fails in loading the key, now written as jwk.construct().

What is causing this strange and random behavior?

An instance that failed once stays permanently broken and doesn't recover in the next request. On the logs there's nothing, although such broken instances drop the memory usage. Here are first two requests from broken and working instances running the same image, same time, for same user pool key.

Broken:

2022-02-14T17:38:17.185+02:00 START RequestId: d9d92287-ccae-4aa4-8f94-6e8ed8c276a4 Version: $LATEST

2022-02-14T17:38:17.205+02:00 [DEBUG] 2022-02-14T15:38:17.205Z d9d92287-ccae-4aa4-8f94-6e8ed8c276a4 Starting new HTTPS connection (1): cognito-idp.us-east-1.amazonaws.com:443

2022-02-14T17:38:20.190+02:00 END RequestId: d9d92287-ccae-4aa4-8f94-6e8ed8c276a4

2022-02-14T17:38:20.190+02:00 REPORT RequestId: d9d92287-ccae-4aa4-8f94-6e8ed8c276a4 Duration: 3003.51 ms Billed Duration: 3000 ms Memory Size: 128 MB Max Memory Used: 53 MB Init Duration: 467.93 ms

2022-02-14T17:38:20.190+02:00 2022-02-14T15:38:20.189Z d9d92287-ccae-4aa4-8f94-6e8ed8c276a4 Task timed out after 3.00 seconds

2022-02-14T17:38:20.706+02:00 START RequestId: a5242265-c13d-4015-9b7d-2699f0b26efe Version: $LATEST

2022-02-14T17:38:20.709+02:00 [DEBUG] 2022-02-14T15:38:20.709Z a5242265-c13d-4015-9b7d-2699f0b26efe Starting new HTTPS connection (1): cognito-idp.us-east-1.amazonaws.com:443

2022-02-14T17:38:23.712+02:00 END RequestId: a5242265-c13d-4015-9b7d-2699f0b26efe

2022-02-14T17:38:23.712+02:00 REPORT RequestId: a5242265-c13d-4015-9b7d-2699f0b26efe Duration: 3004.51 ms Billed Duration: 3000 ms Memory Size: 128 MB Max Memory Used: 23 MB

2022-02-14T17:38:23.712+02:00 2022-02-14T15:38:23.711Z a5242265-c13d-4015-9b7d-2699f0b26efe Task timed out after 3.00 seconds

Working:

2022-02-14T17:38:23.733+02:00 START RequestId: 2ea5db18-c9b5-4df8-b3ef-dfc01f9ede00 Version: $LATEST

2022-02-14T17:38:23.740+02:00 [DEBUG] 2022-02-14T15:38:23.739Z 2ea5db18-c9b5-4df8-b3ef-dfc01f9ede00 Starting new HTTPS connection (1): cognito-idp.us-east-1.amazonaws.com:443

2022-02-14T17:38:23.926+02:00 [DEBUG] 2022-02-14T15:38:23.926Z 2ea5db18-c9b5-4df8-b3ef-dfc01f9ede00 https://cognito-idp.us-east-1.amazonaws.com:443 "GET /us-east-1_.../.well-known/jwks.json HTTP/1.1" 200 916

2022-02-14T17:38:23.942+02:00 [DEBUG] 2022-02-14T15:38:23.941Z 2ea5db18-c9b5-4df8-b3ef-dfc01f9ede00 Got the key a2PUhJTqMTiNysvmY+RfUPARHESV35jOMXWXJ4mAa/A= in 0.20495343208312988 seconds

2022-02-14T17:38:23.960+02:00 [INFO] 2022-02-14T15:38:23.960Z 2ea5db18-c9b5-4df8-b3ef-dfc01f9ede00 response {'principalId': '...', 'policyDocument': {...}} 2022-02-14T17:38:23.980+02:00 END RequestId: 2ea5db18-c9b5-4df8-b3ef-dfc01f9ede00

2022-02-14T17:38:23.980+02:00 REPORT RequestId: 2ea5db18-c9b5-4df8-b3ef-dfc01f9ede00 Duration: 244.45 ms Billed Duration: 245 ms Memory Size: 128 MB Max Memory Used: 55 MB Init Duration: 447.66 ms

2022-02-14T17:38:24.149+02:00 START RequestId: 1cca0b7a-0fa4-477d-9ddd-95d97db113b2 Version: $LATEST

2022-02-14T17:38:24.154+02:00 [DEBUG] 2022-02-14T15:38:24.154Z 1cca0b7a-0fa4-477d-9ddd-95d97db113b2 Got the cached key a2PUhJTqMTiNysvmY+RfUPARHESV35jOMXWXJ4mAa/A=

2022-02-14T17:38:24.155+02:00 [INFO] 2022-02-14T15:38:24.155Z 1cca0b7a-0fa4-477d-9ddd-95d97db113b2 response {'principalId': '...', 'policyDocument': {...}}

2022-02-14T17:38:24.156+02:00 END RequestId: 1cca0b7a-0fa4-477d-9ddd-95d97db113b2

2022-02-14T17:38:24.156+02:00 END RequestId: 1cca0b7a-0fa4-477d-9ddd-95d97db113b2

2022-02-14T17:38:24.156+02:00 REPORT RequestId: 1cca0b7a-0fa4-477d-9ddd-95d97db113b2 Duration: 2.64 ms Billed Duration: 3 ms Memory Size: 128 MB Max Memory Used: 55 MB

  • Nothing firm to offer: But on on the off chance that there is a memory issue here - have you tried increasing the memory to (say) 512MB and see if it fails then? I've seen Python Lambda functions try to allocate memory and fail (silently) but Lambda doesn't show that it is using all of the memory. Complete guess but trying to read the runes.

1 Answer
0

Hi Alexei,

Can you clarify what is printing this line:

2022-02-14T17:38:23.740+02:00 DEBUG 2022-02-14T15:38:23.739Z 2ea5db18-c9b5-4df8-b3ef-dfc01f9ede00 Starting new HTTPS connection (1): cognito-idp.us-east-1.amazonaws.com:443

It seems that a Connection is being made to the cognito endpoint here and nothing else is printed until Lambda times out.

This can happen when Lambda is unable to successfully connect to an external endpoint before the Lambda timeout occurs. From what I can see in the logs, Lambda is trying to connect to the Cognito endpoint but before it is able to make a successful connection or receive a reponse, the Lambda timeout is reached. In most cases, if there is no response within a few seconds then it would best to retry the HTTPS request.

It will depend on the timeout settings and the retry behavior of the library you are using if you will see an error in your Lambda logs or if you will see the retry in your Lambda logs

For example, the Boto3 SDK has a Connection timeout of 60 seconds. Therefore, if it tries to connect to Cognito then it will only timeout after 60 seconds. If your Lambda function has a timeout of 3 or 30 seconds it will appear as if your Lambda has "frozen" when it fails to connect to the Cognito endpoint until such time that the Lambda times out.

Please have a look here in this knowledge center article for more information about this -> https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-retry-timeout-sdk/

You may need to change the connection timeout of the library you are using to reach the cognito endpoint.

If the issue occurs consistently, you can update your timeout settings temporarily with the following formula(found in the above article):

First attempt (connection timeout + socket timeout) + Number of retries x (connection timeout + socket timeout) + 20 seconds additional code runtime margin = Required Lambda function timeout

This should ideally show that the HTTPS connection is being retried.

It's definitely possible that there may be another issues but this is what I'd like to rule out first based on the Lambda logs that have been provided.

AWS
SUPPORT ENGINEER
Ryan_A
answered 2 years ago
  • I see at the same time other instances access the same Cognito endpoint at 200 milisecond max, so the endpoint isn't down. I can't increase this Lambda timeout, it's an authorizer ffs, it can't ask other API to wait for 60 seconds until it figures out to grant or deny access. If it fails, at least it should fail quick. I'm not invoking it with boto, it's invoked by the API gateway, is there a way to modify the network timeouts and retries for this case?

  • Hi,

    Aplogies for the delay. To clarify, I'm not saying that the endpoint is down - that will rarely be the case. The issue here is the client that is trying to connect to the endpoint. A client can experience intermittent/transient issues which may prevent it from being able to successfully invoke an external endpoint. This can happen to a small percentage of Lambda invocations although it can definitely happen to other clients as well.

    That is correct and what I am getting at -> the request that Lambda is making needs to fail fast in cases like this where it is unable to reach the endpoint. You will need to check the library which you are using to invoke the external endpoint and change the timeout so that if fails fast instead of waiting for the default timeout which seems to be greater than your Lambda timoeut. This will need to be configured in the client library. The knowledge center article shows how it can be done for boto3 but for example here is how you can set the timeout in the requests library -> https://docs.python-requests.org/en/master/user/advanced/#timeouts

    Apologies if the above formula caused confusion, as mentioned, I meant to update the timeout settings temporarily so that you can see the HTTP request being retried to confirm if that is indeed the issue.

    Can you clarify how exactly the cognito endpoint is being invoked here? What is printing this line Starting new HTTPS connection (1): cognito-idp.us-east-1.amazonaws.com:443

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions