It there a way to add Exponential Backoff to AWS:ExecuteAutomation in an SSM Automation Document?

0

Put simply, I have written an SSM Automation. It creates a snapshot of each of the attached volumes of a targeted instance, by getting a list of Volume Ids from a DescribeInstance call. It then utilizes Rate Execution of a second runbook, in an AWS:executeAutomation action, which creates each snapshot, fanning out on those volume ids. As you can see, I have already limited MaxConcurrency of this step to 1.

{
 "name":"SnapshotAllVolumes",
 "action":"aws:executeAutomation",
 "maxAttempts": 3,
 "onFailure":"Abort",
 "inputs":{
    "DocumentName": "MyCustomCreateSnapshotRunbook",
    "Targets": [
       {
         "Key": "ParameterValues",
         "Values": [
           "{{ DescribeInstance.CurrentVolumes }}"
         ]
       }
     ],
    "TargetParameterName": "VolumeId",
    "RuntimeParameters": {
       "InstanceId": "{{ InstanceId }}",
       "InstanceName": "{{ GetInstanceName.Name }}"
    },
    "MaxConcurrency": "1"
 }

Where I have gotten into trouble is I am attempting to execute this runbook via Maintenance Window on ~60 instances, each averaging about 3 EBS volumes. I keep crashing into a rate limit on the above step.

Step fails when it is Executing. Fail to start automation, errorMessage: Rate exceeded. Please refer to Automation Service Troubleshooting Guide for more diagnosis details.

Unfortunately, it doesn't tell me exactly which rate limit I'm hitting, but I think I can assume it is one of two: either the limit on api calls, or the limit on simultaneous SSM rate executions. Because the latter is much more restrictive(25 is the max), I think it's my most likely suspect. I've been dialing down the concurrency limit on my Maintenance Window. If my assumption about rate limit is correct, I need to stay under 25 concurrent Rate Executions.

With the max concurrency of 1 on the child runbook, no individual execution of the parent runbook should lead to more than 2 simultaneous rate executions(the parent, and x*(number of volumes) consecutive executions). This means my concurrency rate for my maintenance window needs to keep below 25/2, so I've reckoned my max safe concurrency is a mere 12.

Talking all this over with support, the solution recommended was to implement some sort of exponential backoff on the retries here. That way the calls that are rate limited on the first attempt are retried at different times, et al. This could be done through code if resorted to invoking a lambda that executed the child Automation, instead of executing directly in the Parent Runbook.

I'd much rather not need to introduce a dependency on a Lambda here. I was wondering if anyone has a way to implement incremental backoff purely in SSM automation? Perhaps it isn't even possible?

asked 2 years ago583 views
1 Answer
0

Hello, thank you for your post. Regarding implementing incremental backoff purely in SSM automation, this cannot be directly enabled. The closest option is setting an error threshold on the Run Command page in the SSM console. Reviewing the information you provided, I suspect your document named MyCustomCreateSnapshotRunbook makes use of the CreateSnapshot API or CreateSnapshots. I suspect this is the API for which you are hitting a rate limit. I do recommend you use the CreateSnapshots API, as it takes an instance ID as input and creates snapshots for all volumes attached to the instance[1] - this would make it less likely to reach a rate limit which could occur if the CreateSnapshot API were repeatedly called for individual volume IDs.
As the issue you are facing involves details that are specific to your use case and the resources in your account, I encourage you to open a support case so that our support team can review the API calls that were made and the specific errors that were encountered. This will allow us to provide your better guidance to resolve the issue.

References:
[1] https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CreateSnapshots.html

AWS
SUPPORT ENGINEER
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions