- Newest
- Most votes
- Most comments
If the error was related to the number of concurrent training jobs (either overall in your AWS account+region, or for a specific training instance type), it should usually present as a ResourceLimitExceeded exception - and the resolution would be raise a request in the Service Quotas console to increase your quotas for e.g. "ml.XYZ.ABC for training job usage", "Number of instances across all training jobs", or etc.
Since you're seeing a ThrottlingException, it sounds instead like you're running in to the "Rate of CreateTrainingJob requests" quota limit on the rate at which new training jobs can be requested (1TPS by default I believe). You can implement a retry with backoff in your State Machine to automatically retry the step on failure: No need to wait for the job to finish, just a second or so should suffice if you only have the 2 calls active - but best-practice would be to allow a few retries in case there are other users or workflows also creating jobs in the account.
It's also worth mentioning that SageMaker has its own native pipeline orchestration solution in SageMaker Pipelines - which seems (in my cases at least) to handle this without needing the retry... But while Step Functions has native integrations to a broad range of AWS services, SageMaker Pipelines is a bit more specialized towards SageMaker (although it does offer Lambda & callback-based steps giving enough flexibility for wedging a range of other things in if needed).
Relevant content
- asked 4 years ago
