How do I troubleshoot issues with my Amazon SageMaker AI Serverless Inference endpoint?

6분 분량
0

I receive an error when I use an Amazon SageMaker AI Serverless Inference endpoint for my workload.

Resolution

Troubleshoot your issue based on the error message that you receive.

Image container size too large for Serverless Inference endpoint

Serverless Inference endpoints support Bring Your Own Container (BYOC), similar to real-time endpoints. Because AWS Lambda backs this specific type of inference, the container size must be less than 10 GB for this specific type of inference.

If your container exceeds the 10 GB limit, then you receive an error message similar to the following:

"Image size 11271073144 is greater than the supported size 10737418240"

To resolve the issue, take one of the following actions:

  • Remove unused packages and minimize the number of layers in your Docker file to reduce the image size and optimize the Docker container.
  • Use a smaller base image and create a new endpoint configuration. Specify the desired instance type and other relevant parameters for hosting the real-time endpoint.
  • Transition from a serverless endpoint to a real-time endpoint. Create a new endpoint configuration to specify your instance type and other parameters to host the real-time endpoint. Then, update your existing endpoint with the new configuration.
  1. To create a new endpoint configuration, use the following template:
    import boto3
    
    client = boto3.client('sagemaker', region_name = 'us-east-1')
    
    endpoint_config_name = 'new-endpoint-config'
    production_variants = [
        {
            'VariantName': 'AllTraffic',
            'ModelName': 'ModelName',
            'InitialInstanceCount': 1,
            'InstanceType': 'ml.m5.xlarge',
            'InitialVariantWeight': 1.0
        }
    ]
    
    create_endpoint_config_response = client.create_endpoint_config(
        EndpointConfigName=endpoint_config_name,
        ProductionVariants=production_variants
    )
    Note: Replace YourVariantName with the name of your instance's variant and YourModelName with the name of your model. Replace 1 with your initial instance count number. Replace ml.m5.xlarge with your instance type. Replace 1.0 with your instance's initial variant weight.
  2. Create a new endpoint with the new configuration to transition it from a serverless to a real-time endpoint. Or, update an existing serverless endpoint with the new configuration.
    Create a new endpoint example:
    endpoint_name = 'new-endpoint'
    
    create_endpoint_response = client.create_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=endpoint_config_name
    )
    Update an existing endpoint example:
    endpoint_name = 'my-old-endpoint'
    
    update_endpoint_response = client.update_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=endpoint_config_name
    )

To determine the container size, retrieve the image URI from Amazon Elastic Container Registry (ECR). Then, run the following Docker command to obtain the Docker image details:

docker images

Insufficient memory or disk space in the Serverless Inference endpoint

SageMaker AI serverless endpoints have a maximum memory allocation of 6 GB and an ephemeral disk storage capacity of 5 GB.

If the memory of the serverless endpoint exceeds the limit, then you receive an error message similar to the following:

"UnexpectedStatusException: Error hosting endpoint: Failed. Reason: Ping failed due to insufficient memory."

To resolve the issue, choose from the following options:

  • Adjust the MemorySizeInMB parameter.
  • Optimize your worker configuration.
  • Deploy the endpoint as a real-time endpoint and provide your required instance details.

Adjust the MemorySizeInMB parameter

  1. Create a new serverless endpoint configuration with an increased MemorySizeInMB.
    Example:
    import boto3
    
    client = boto3.client('sagemaker', region_name='us-east-1')
    
    endpoint_config_name = "new-endpoint-config"
    
    endpoint_config_response = client.create_endpoint_config(
        EndpointConfigName=endpoint_config_name,
        ProductionVariants=[
            {
                "VariantName": YourVariantName,
                "ModelName": YourModelName,
                "ServerlessConfig": {
                    "MemorySizeInMB": NewMemorySize,
                    "MaxConcurrency": 1,
                },
            },
        ],
    )
    Note: Replace YourVariantName with the name of your variant and YourModelName with the name of your model. Replace NewMemorySize with the memory size limit. The maximum value that you can use is 6 GB.
  2. Update the serverless endpoint with the new configuration.
    Example:
    endpoint_name = 'my-old-endpoint'
    
    update_endpoint_response = client.update_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=endpoint_config_name
    )

Optimize your worker configuration

To avoid a memory error, create only one worker in the container that uses all available CPU resources. This approach differs from real-time endpoints, where some SageMaker AI containers might create a worker for each vCPU.

For prebuilt containers with the SageMaker AI inference toolkit, modify the model creation for inference by setting SAGEMAKER_MODEL_SERVER_WORKERS to 1.

Example:

import boto3

client = boto3.client('sagemaker', region_name='us-east-1')

response = client.create_model(
    ModelName='YourModelName',
    Containers=[
    {
        'Image': "YourImage"
        'Mode': 'SingleModel',
        'ModelDataUrl': "YourModelDataURL",
        'Environment': {
            'SAGEMAKER_CONTAINER_LOG_LEVEL': '20',
            'SAGEMAKER_MODEL_SERVER_WORKERS': '1'
        }
    }
    ],
    ExecutionRoleArn=role
)

Note: Replace YourModelName with the name of your model, YourImage with the name of your image, and YourModelDataUrl with your model data URL.

Deploy the endpoint as a real-time endpoint and provide your required instance details

For information on how to deploy the endpoint as a real-time endpoint, see "Transition from a serverless endpoint to a real-time endpoint" in this article.

Insufficient disk space

If your serverless endpoint doesn't have available disk space, then you might receive the following error message:

"OSError: [Errno 28] No space left on device"

To resolve this issue, take the following actions:

  • Make sure that the compressed model size doesn't exceed the available disk space when decompressed. A file decompresses to three times its compressed size.
  • Use a smaller model artifact, and then make sure that all files in the .tar archive are essential in your deployment.
  • If serverless inference isn't feasible, then deploy to a real-time endpoint to customize the instance specifications.

Cold start scenarios in Serverless Inference endpoints

Because of the on-demand nature of resource provisioning and SageMaker AI serverless inference limitations, there isn't a definitive method to pre-warm a SageMaker AI Serverless Inference endpoint. For more information, see Minimizing cold starts.

For the allocated ProvisionedConcurrency, SageMaker AI maintains the endpoint in a warm state to respond within milliseconds. To minimize cold start latency issues, update your endpoint with ProvisionedConcurrency. For more information, see Metrics for monitoring Amazon SageMaker AI with Amazon CloudWatch.

Complete the following steps:

  1. Create a new serverless endpoint configuration with ProvisionedConcurrency that is less than or equal to MaxConcurrency.
    Example configuration:
    import boto3
    client = boto3.client('sagemaker', region_name='us-east-1')
    
    endpoint_config_name = "new-endpoint-config"
    
    endpoint_config_response = client.create_endpoint_config(
        EndpointConfigName=endpoint_config_name,
        ProductionVariants=[
            {
                "VariantName": YourVariantName,
                "ModelName": YourModelName,
                "ServerlessConfig": {
                    "MemorySizeInMB": NewMemorySize,
                    "MaxConcurrency": 1,
                    "ProvisionedConcurrency": 1,
                },
            },
        ],
    )
    Note: Replace YourVariantName with the name of your variant and YourModelName with the name of your model. Replace NewMemorySize with the memory size limit. The maximum value that you can use is 6 GB.
  2. Update the endpoint with the new configuration.
    Example:
    endpoint_name = 'my-serverless-endpoint'
    update_endpoint_response = client.update_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=endpoint_config_name
    )

Note: It's a best practice to use the SageMaker AI Serverless Inference Benchmarking Toolkit to determine the most efficient deployment configuration for your serverless endpoint.

Related information

Deploy models with Amazon SageMaker AI Serverless Inference

AWS 공식
AWS 공식업데이트됨 15일 전