By using AWS re:Post, you agree to the Terms of Use

Intermittent ConnectTimeoutError accessing SSM

0

My app uses SSM Parameter Store on Fargate instances and locally in a Docker container. We're accessing it with Boto3 from Python. Multiple developers on my team, in different countries, have seen a very intermittent issue, cropping up maybe once every 1–4 weeks, where for 10 minutes or so, calls to SSM will fail with this error:

botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://ssm.us-east-2.amazonaws.com/"

The ECS instances do not see the issue as far as I'm aware, this is only a problem when we're accessing the endpoint via Boto3 from our home networks. It occurs to me now that I haven't verified whether all users see the problem at the same time, or if it's just one user at a time. I will try to test this the next time I see it.

I have tried:

  1. Reducing the number of calls we make to SSM. It's now down to about 2/sec per user at the maximum, with effectively no other users cuncurrently hitting the API. So we're never getting anywhere near the 40 requests/second limit. In looking at the logs, the most I can see is 12 requests in one minute. We're just not using this very agressively, so it doesn't seem possible that the problem is throttling. All of our calls are paginated calls to GetParametersByPath, and we are using WithDecryption=true.
  2. Changing the Boto3 retry method from Legacy to Standard. This is probably a good thing to do anyway, but doesn't seem to have fixed the problem.

The only reliable solution I've come up with is to wait. Eventually, the endpoint comes back and my application begins working again. But this is really an unacceptable level of service interruption, and I feel like I must be doing something wrong.

Is there a setting I have overlooked? Does anyone have any troubleshooting suggestions for things to try when I inevitably see the problem again?

1 Answers
0

The error you are getting is not throttling. Throttling would throw a ThrottlingException or Rate Exceeded error not connect timeout. Connect timeout seems network related, Are you using any proxy server or bastion to connect to the SSM endpoint which might be getting overloaded. This also explains why issue only occurs on home network because if it were throttling the ECS tasks would experience the same

Even though I don't believe the issue is throttling you should be doing both the things you have tried to avoid throttling issues in future.
Reducing the number of calls - I additionally evaluate which parameters can be loaded once when the container starts as opposed to being loaded multiple time
Set the retry and backoff parameters which make sense for your application, eg how long do you want retries to occur both an error should be thrown

EXPERT
answered 16 days ago
  • Thank you for the suggestions! To answer your questions:

    1. I am not using a bastion/proxy server to access SSM. Good point about ECS and throttling.
    2. As part of debugging this, I dramatically reduced the SSM calls from ~12/14 per second on process start to 2. The values are stored at process launch and aren't fetched again.
    3. Right now, I'm just using the standard values:
            config = BotoConfig(connect_timeout=3 retries={"mode": "standard"})
            self._client =  boto3.client("ssm", config=config)
    

    The only options I see in the retries dictionary are mode and max_attempts, nothing about backoff. Is there another place to configure that?

    I found this boto3 bug on GitHub that suggested boto3.set_stream_logger('') as a way to get verbose logging of what's happening with connections and retries, so I've added that into my application. It really is very verbose, so I'm hoping I don't have to leave it on very long, LOL!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions