Put simply, I have written an SSM Automation. It creates a snapshot of each of the attached volumes of a targeted instance, by getting a list of Volume Ids from a DescribeInstance call. It then utilizes Rate Execution of a second runbook, in an AWS:executeAutomation action, which creates each snapshot, fanning out on those volume ids. As you can see, I have already limited MaxConcurrency of this step to 1.
{
"name":"SnapshotAllVolumes",
"action":"aws:executeAutomation",
"maxAttempts": 3,
"onFailure":"Abort",
"inputs":{
"DocumentName": "MyCustomCreateSnapshotRunbook",
"Targets": [
{
"Key": "ParameterValues",
"Values": [
"{{ DescribeInstance.CurrentVolumes }}"
]
}
],
"TargetParameterName": "VolumeId",
"RuntimeParameters": {
"InstanceId": "{{ InstanceId }}",
"InstanceName": "{{ GetInstanceName.Name }}"
},
"MaxConcurrency": "1"
}
Where I have gotten into trouble is I am attempting to execute this runbook via Maintenance Window on ~60 instances, each averaging about 3 EBS volumes. I keep crashing into a rate limit on the above step.
Step fails when it is Executing. Fail to start automation, errorMessage: Rate exceeded. Please refer to Automation Service Troubleshooting Guide for more diagnosis details.
Unfortunately, it doesn't tell me exactly which rate limit I'm hitting, but I think I can assume it is one of two: either the limit on api calls, or the limit on simultaneous SSM rate executions. Because the latter is much more restrictive(25 is the max), I think it's my most likely suspect. I've been dialing down the concurrency limit on my Maintenance Window. If my assumption about rate limit is correct, I need to stay under 25 concurrent Rate Executions.
With the max concurrency of 1 on the child runbook, no individual execution of the parent runbook should lead to more than 2 simultaneous rate executions(the parent, and x*(number of volumes) consecutive executions). This means my concurrency rate for my maintenance window needs to keep below 25/2, so I've reckoned my max safe concurrency is a mere 12.
Talking all this over with support, the solution recommended was to implement some sort of exponential backoff on the retries here. That way the calls that are rate limited on the first attempt are retried at different times, et al. This could be done through code if resorted to invoking a lambda that executed the child Automation, instead of executing directly in the Parent Runbook.
I'd much rather not need to introduce a dependency on a Lambda here. I was wondering if anyone has a way to implement incremental backoff purely in SSM automation? Perhaps it isn't even possible?