Skip to content

Rate exceeded (Service: AmazonSageMaker; Status Code: 400; Error Code: ThrottlingException;

0

I'm using step function to run Sagemaker training jobs.

Right now I have a distributed map step that is kicking off these training jobs, but it's running in serial due to these throttling exceptions.

Even just 2 concurrent training jobs would significantly reduce our overall training time. Any ideas about how to achieve this?

1 Answer
1

If the error was related to the number of concurrent training jobs (either overall in your AWS account+region, or for a specific training instance type), it should usually present as a ResourceLimitExceeded exception - and the resolution would be raise a request in the Service Quotas console to increase your quotas for e.g. "ml.XYZ.ABC for training job usage", "Number of instances across all training jobs", or etc.

Since you're seeing a ThrottlingException, it sounds instead like you're running in to the "Rate of CreateTrainingJob requests" quota limit on the rate at which new training jobs can be requested (1TPS by default I believe). You can implement a retry with backoff in your State Machine to automatically retry the step on failure: No need to wait for the job to finish, just a second or so should suffice if you only have the 2 calls active - but best-practice would be to allow a few retries in case there are other users or workflows also creating jobs in the account.

It's also worth mentioning that SageMaker has its own native pipeline orchestration solution in SageMaker Pipelines - which seems (in my cases at least) to handle this without needing the retry... But while Step Functions has native integrations to a broad range of AWS services, SageMaker Pipelines is a bit more specialized towards SageMaker (although it does offer Lambda & callback-based steps giving enough flexibility for wedging a range of other things in if needed).

AWS
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.