Still getting Intermittent ConnectTimeoutError when accessing SSM in Docker

0

A few months ago, I posted about a timeout error I'm getting from the SSM service. I'm still seeing the problem. Here's the original thread. It's intermittent, but frustratingly common (often every 30–120 min) during development when my Flask server is restarting all the time and thus hitting the SSM endpoint as often as I press Save (so maybe bursts of 5–20 hits a minute).

But this is FAR below the endpoint's limit, and I'm getting a timeout error and not a throttling error. To my knowledge, I haven't seen the error when the container is running on ECS, it's only an issue in local development. But ~5 people on my team are seeing the same behvaior when using different ISPs in multiple countries. The SSM host is reachable from outside the container, but once the issue appears, the container will be unable to access this host (and JUST this host) for 5–10 minutes. Connections to other URLs work fine. Restarting the container doesn't help. Using VPN to change my IP doesn't help.

I'm at my wit's end, and it's really impeding local development at this point. I'm at a loss for how the problem is only affecting SSM, it really seems like it must be something to do with either Docker or AWS. I've looked, and I see the GetParametersByPath events in CloudTrail when things are working normally, but nothing when the connection is failing.

I really need a solution to this. Can anyone suggest other things to try?

UPDATE 2022-11-04: I have noticed a similar issue with Cognito. One of the first things my app does upon launching and receiving a new request is to try to connect to Cognito to fetch the JWKS in order to verify the JWT. From outside the container, this succeeds in about a second:

$ curl https://cognito-idp.us-east-2.amazonaws.com/us-east-2_xxxxxxxx/.well-known/jwks.json
{"keys":[{"alg":"RS256","e":"AQAB","kid":" […]

But when I open a shell into the container that's stuck waiting for Cognito, Python can't reach the host:

>>> import requests
>>> requests.get("https://apple.com") # Returns instantly
<Response [200]>
>>> requests.get("https://cognito-idp.ap-southeast-1.amazonaws.com/ap-southeast-1_ie577myCv/.well-known/jwks.json")
# Returns relatively quickly
<Response [404]>
>>> requests.get("https://cognito-idp.us-east-2.amazonaws.com/us-east-2_xxxxxxxx/.well-known/jwks.json")
# Hangs for a long time

Eventually, the host comes back. But WHY is it reachable with no problems from outside the container?? I'm caching the JWKS, so I only need to send the request once at the time of first request. But that's enough to bring my app to a standstill…

1 Answer
0
Accepted Answer

This is a Docker issue. If you're experiencing it, you can download an older version of Docker Desktop which doesn't have the problem. I've been running version 4.5.0 for 9 days now and the connection hangs haven't happened the whole time.

nk9
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions