A few months ago, I posted about a timeout error I'm getting from the SSM service. I'm still seeing the problem. Here's the original thread. It's intermittent, but frustratingly common (often every 30–120 min) during development when my Flask server is restarting all the time and thus hitting the SSM endpoint as often as I press Save (so maybe bursts of 5–20 hits a minute).
But this is FAR below the endpoint's limit, and I'm getting a timeout error and not a throttling error. To my knowledge, I haven't seen the error when the container is running on ECS, it's only an issue in local development. But ~5 people on my team are seeing the same behvaior when using different ISPs in multiple countries. The SSM host is reachable from outside the container, but once the issue appears, the container will be unable to access this host (and JUST this host) for 5–10 minutes. Connections to other URLs work fine. Restarting the container doesn't help. Using VPN to change my IP doesn't help.
I'm at my wit's end, and it's really impeding local development at this point. I'm at a loss for how the problem is only affecting SSM, it really seems like it must be something to do with either Docker or AWS. I've looked, and I see the GetParametersByPath
events in CloudTrail when things are working normally, but nothing when the connection is failing.
I really need a solution to this. Can anyone suggest other things to try?
UPDATE 2022-11-04: I have noticed a similar issue with Cognito. One of the first things my app does upon launching and receiving a new request is to try to connect to Cognito to fetch the JWKS in order to verify the JWT. From outside the container, this succeeds in about a second:
$ curl https://cognito-idp.us-east-2.amazonaws.com/us-east-2_xxxxxxxx/.well-known/jwks.json
{"keys":[{"alg":"RS256","e":"AQAB","kid":" […]
But when I open a shell into the container that's stuck waiting for Cognito, Python can't reach the host:
>>> import requests
>>> requests.get("https://apple.com") # Returns instantly
<Response [200]>
>>> requests.get("https://cognito-idp.ap-southeast-1.amazonaws.com/ap-southeast-1_ie577myCv/.well-known/jwks.json")
# Returns relatively quickly
<Response [404]>
>>> requests.get("https://cognito-idp.us-east-2.amazonaws.com/us-east-2_xxxxxxxx/.well-known/jwks.json")
# Hangs for a long time
Eventually, the host comes back. But WHY is it reachable with no problems from outside the container?? I'm caching the JWKS, so I only need to send the request once at the time of first request. But that's enough to bring my app to a standstill…