Sagemaker Pipelines API rate limit exceeded

2

I wish to train 48 models in parallel in Sagemaker Pipelines using 48 TrainingSteps. I cannot use a hyperparameter tuning job as the quota limit is only 10 parallel training jobs and this cannot be increased. I have configured the quota to allow up to 48 machines to be used in parallel, and the pipeline compiles and starts successfully. The training jobs all complete successfully when I look at the Sagemaker Training jobs dashboard.

The problem is that the pipeline it self fails. Some of the training steps register as complete, but many of the state they have failed with the error: 'Failed to invoke sagemaker.DescribeTraining.Job. Error Details: Rate exceeded'.

The rate limit is 5/sec for DescribeTraining.Job and this cannot be changed, so it seems when the pipeline is executed, it is hitting this rate limit when updating the status of the pipeline and causing the pipeline to fail.

  • I also get the same error that occurs when I try to run only 3 SM Pipelines at the same time, with just a single TrainingStep in each Pipeline. The training job succeeded, but the SM Pipeline fails. My training job is around 5-6 hours long, so unless this is resolved, I cannot rely on SM Pipeline to train the models, as it is very compute and time expensive to re-run the entire training step.

已提問 2 年前檢視次數 594 次
1 個回答
0

Please refer to the link below related to the Amazon SageMaker endpoints and quotas: https://docs.aws.amazon.com/general/latest/gr/sagemaker.html

Per the link - Maximum number of training jobs each hyper parameter tuning job can run in parallel at once Each supported Region: 10 No Maximum number of training jobs each hyper parameter tuning job can run in parallel at once

As the limit is not adjustable, hence by raising a support case the limit can not be increased.

AWS
已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南