Sagemaker Pipelines API rate limit exceeded

2

I wish to train 48 models in parallel in Sagemaker Pipelines using 48 TrainingSteps. I cannot use a hyperparameter tuning job as the quota limit is only 10 parallel training jobs and this cannot be increased. I have configured the quota to allow up to 48 machines to be used in parallel, and the pipeline compiles and starts successfully. The training jobs all complete successfully when I look at the Sagemaker Training jobs dashboard.

The problem is that the pipeline it self fails. Some of the training steps register as complete, but many of the state they have failed with the error: 'Failed to invoke sagemaker.DescribeTraining.Job. Error Details: Rate exceeded'.

The rate limit is 5/sec for DescribeTraining.Job and this cannot be changed, so it seems when the pipeline is executed, it is hitting this rate limit when updating the status of the pipeline and causing the pipeline to fail.

  • I also get the same error that occurs when I try to run only 3 SM Pipelines at the same time, with just a single TrainingStep in each Pipeline. The training job succeeded, but the SM Pipeline fails. My training job is around 5-6 hours long, so unless this is resolved, I cannot rely on SM Pipeline to train the models, as it is very compute and time expensive to re-run the entire training step.

질문됨 2년 전594회 조회
1개 답변
0

Please refer to the link below related to the Amazon SageMaker endpoints and quotas: https://docs.aws.amazon.com/general/latest/gr/sagemaker.html

Per the link - Maximum number of training jobs each hyper parameter tuning job can run in parallel at once Each supported Region: 10 No Maximum number of training jobs each hyper parameter tuning job can run in parallel at once

As the limit is not adjustable, hence by raising a support case the limit can not be increased.

AWS
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠