Intermittent ConnectTimeoutError accessing SSM

0

My app uses SSM Parameter Store on Fargate instances and locally in a Docker container. We're accessing it with Boto3 from Python. Multiple developers on my team, in different countries, have seen a very intermittent issue, cropping up maybe once every 1–4 weeks, where for 10 minutes or so, calls to SSM will fail with this error:

botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://ssm.us-east-2.amazonaws.com/"

The ECS instances do not see the issue as far as I'm aware, this is only a problem when we're accessing the endpoint via Boto3 from our home networks. It occurs to me now that I haven't verified whether all users see the problem at the same time, or if it's just one user at a time. I will try to test this the next time I see it.

I have tried:

  1. Reducing the number of calls we make to SSM. It's now down to about 2/sec per user at the maximum, with effectively no other users cuncurrently hitting the API. So we're never getting anywhere near the 40 requests/second limit. In looking at the logs, the most I can see is 12 requests in one minute. We're just not using this very agressively, so it doesn't seem possible that the problem is throttling. All of our calls are paginated calls to GetParametersByPath, and we are using WithDecryption=true.
  2. Changing the Boto3 retry method from Legacy to Standard. This is probably a good thing to do anyway, but doesn't seem to have fixed the problem.

The only reliable solution I've come up with is to wait. Eventually, the endpoint comes back and my application begins working again. But this is really an unacceptable level of service interruption, and I feel like I must be doing something wrong.

Is there a setting I have overlooked? Does anyone have any troubleshooting suggestions for things to try when I inevitably see the problem again?


Update (2022-10-11): I am still experiencing this. I have added a connectivity test which fetches 4 URLs. The first 3 always succeed. The 4th only succeeds sometimes:

https://www.apple.com status_code=200 len=104235 0.044803sec
https://cognito-idp.us-east-2.amazonaws.com status_code=400 len=113 0.378667sec
https://xxxxxxxxxxxx.dkr.ecr.us-east-2.amazonaws.com/ status_code=401 len=15 0.385089sec
https://ssm.us-east-2.amazonaws.com status_code=404 len=29 0.384169sec

(Don't be confused by the 40x status codes. Those are just because I haven't sent a real, authenticated request. The key thing is that I received a timely response.)

This same request fails other times:

requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='ssm.us-east-2.amazonaws.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0xffff8e6af550>, 'Connection to ssm.us-east-2.amazonaws.com timed out. (connect timeout=3)'))

I set the timeout to 3 seconds here, but it has also timed out when I let the connection wait for over 2 minutes. This is a direct HTTPS fetch with requests, so I'm not even using boto3.

What would make only the SSM host become unreachable so often? I don't see how this could be an issue with my Docker container if other URLs work just fine.

3 Answers
0
Accepted Answer

This turned out to be a Docker Desktop issue. You can work around it by using an older version of Docker Desktop, 4.5.0 (Mac) or 4.5.1 (Win).

nk9
answered a year ago
profile picture
EXPERT
reviewed 3 months ago
0

The error you are getting is not throttling. Throttling would throw a ThrottlingException or Rate Exceeded error not connect timeout. Connect timeout seems network related, Are you using any proxy server or bastion to connect to the SSM endpoint which might be getting overloaded. This also explains why issue only occurs on home network because if it were throttling the ECS tasks would experience the same

Even though I don't believe the issue is throttling you should be doing both the things you have tried to avoid throttling issues in future.
Reducing the number of calls - I additionally evaluate which parameters can be loaded once when the container starts as opposed to being loaded multiple time
Set the retry and backoff parameters which make sense for your application, eg how long do you want retries to occur both an error should be thrown

AWS
EXPERT
Peter_G
answered 2 years ago
  • Thank you for the suggestions! To answer your questions:

    1. I am not using a bastion/proxy server to access SSM. Good point about ECS and throttling.
    2. As part of debugging this, I dramatically reduced the SSM calls from ~12/14 per second on process start to 2. The values are stored at process launch and aren't fetched again.
    3. Right now, I'm just using the standard values:
            config = BotoConfig(connect_timeout=3 retries={"mode": "standard"})
            self._client =  boto3.client("ssm", config=config)
    

    The only options I see in the retries dictionary are mode and max_attempts, nothing about backoff. Is there another place to configure that?

    I found this boto3 bug on GitHub that suggested boto3.set_stream_logger('') as a way to get verbose logging of what's happening with connections and retries, so I've added that into my application. It really is very verbose, so I'm hoping I don't have to leave it on very long, LOL!

  • Would appreciate your thoughts given the new info added to the question, @Peter_G.

0

I had a similar issue with SSM timing out (specifically in AWS Govcloud when trying to access a ParameterStore entry that didn't exist) when running within a Lambda. Modifying the boto_config to increase the number of retries as well as the retry mode appears to have resolved what was a consistent issue and did not lead to a notably longer runtime.

from botocore.config import Config
config = Config(retries={"max_attempts": 10, "mode": "adaptive"})
region = "us-gov-west-1"
ssm = boto3.client( service_name="ssm", region_name=region, config=config)
parameter_response = ssm.get_parameter(Name="redacted")
JS
answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions