How do I resolve SageMaker Python SDK rate exceeded and throttling exceptions?

2 minute read
1

I want to resolve throttling errors when I use Amazon SageMaker Python SDK.

Short description

API calls to any AWS service can't exceed the maximum allowed API request rate per account and per AWS Region. These API calls might be from an application, AWS Command Line Interface (AWS CLI), or AWS Management Console. If the API requests exceed the maximum rate, then you receive the "Rate Exceeded" error, and the API calls are throttled.

You receive an error message, such as "botocore.exceptions.ClientError: An error occurred (ThrottlingException)." You get this error when you request the SageMaker APIs because of the default retry configuration in Boto3Override this configuration to increase the number of retry attempts and timeouts it takes to connect and read a response.

Resolution

To resolve this error, add a SageMaker Boto3 client with a custom retry configuration to the SageMaker Python SDK client.

  1. Create a SageMaker Boto3 client with a custom retry configuration.

    import boto3 from botocore.config 
    import Config
    
    sm_boto = boto3.client('sagemaker', config=Config(connect_timeout=5, read_timeout=60, retries={'max_attempts': 20}))
    print(sm_boto.meta.config.retries)
  2. Use the Boto3 client from the previous step to create a SageMaker Python SDK client.

    import sagemaker
    
    sagemaker_session = sagemaker.Session(sagemaker_client = sm_boto)
    region = sagemaker_session.boto_session.region_name
    print(sagemaker_session.sagemaker_client.meta.config.retries)
  3. Test a SageMaker API with multiple requests from the SageMaker Python SDK.

    import multiprocessing
    
    def worker(TrainingJobName):
        print(sagemaker_session.sagemaker_client
              .describe_training_job(TrainingJobName=TrainingJobName)
              ['TrainingJobName'])
        return
    
    if __name__ == '__main__':
        jobs = []
        TrainingJobName = 'your-job-name'
        for i in range(10):
            p = multiprocessing.Process(target=worker, args=(TrainingJobName,))
            jobs.append(p)
            p.start()
  4. Create an instance of the sagemaker.estimator.Estimator class with the sagemaker_session parameter.

    image_uri = '##########' 
    s3path = '#########'
    
    estimator = sagemaker.estimator.Estimator(image_uri,
                                              sagemaker.get_execution_role(), 
                                              instance_count=1,
                                              instance_type='ml.c4.4xlarge', 
                                              volume_size = 30,
                                              max_run = 360000,
                                              input_mode= 'File',
                                              output_path=s3path,
                                              sagemaker_session=sagemaker_session)
  5. To confirm that the retry configuration resolves the throttling exceptions, launch a training job from the estimator that you created in the previous step.

    estimator.fit()

Related information

Boto3 documentation

AWS OFFICIAL
AWS OFFICIALUpdated 7 months ago
3 Comments

Some how this does not work for me. The estimator still fails with botocore.exceptions.ClientError: An error occurred (ThrottlingException) when calling the UpdateTrialComponent operation (reached max retries: 4): Rate exceeded even though when I look into the estimator, just before calling estimator.fit(), estimator.sagemaker_session.sagemaker_client.meta.config.retries={'mode': 'standard', 'total_max_attempts': 21}. Thanks very much for your help!

replied 9 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 9 months ago

I should have pobably provided that I am using sagemaker==2.197.0. Is there any update, yet? Thanks very much!

replied 8 months ago