How do I troubleshoot SageMaker AI multi-model endpoint issues?

2 minute read
0

I want to troubleshoot issues that occur when I deploy an Amazon SageMaker AI multi-model endpoint.

Resolution

Cold starts or latency issues that cause an endpoint to take longer to make predictions can affect multi-model endpoint performance. Internal server exceptions can also affect performance.

Cold starts and latency issues

When a multi-model endpoint invokes models in rapid succession, SageMaker AI removes the models from the endpoint's memory and disk. A cold start happens when the model loads into the instance's memory endpoint the next time the endpoint invokes the model.

To prevent cold start or latency issues, take the following actions:

  • If your model is time sensitive, then use real-time inference with a single model.

  • If you're model isn't time sensitive, then select an instance type that has enough memory to hold many models in memory, such as an r5 instance. Or, implement auto scaling that scales based on the MemoryUtilization factor.
    Example:

    response = client.put_scaling_policy(
        PolicyName='MemoryUtilisation-ScalingPolicy',
        ServiceNamespace='sagemaker',
        ResourceId=resource_id,
        ScalableDimension='sagemaker:variant:DesiredInstanceCount',
        PolicyType='TargetTrackingScaling',
        TargetTrackingScalingPolicyConfiguration={
            'TargetValue': 80.0,
            'CustomizedMetricSpecification':
            {
                'MetricName': 'MemoryUtilization',
                'Namespace': '/aws/sagemaker/Endpoints',
                'Dimensions': [
                    {'Name': 'EndpointName', 'Value': endpoint_name },
                    {'Name': 'VariantName','Value': 'AllTraffic'}
                ],
                'Statistic': 'Average', 
                'Unit': 'Percent'
            },
            'ScaleInCooldown': 600,
            'ScaleOutCooldown': 300
        }
    )
  • Before you deploy the endpoint configuration, make sure that you specify a VolumeSizeInGB size that's large enough to hold all the frequently accessed models.

  • Send test requests to the target model so that some models endlessly remain in the endpoint's memory.

  • To optimize multi-model endpoint performance, it's a best practice to select model artifacts with comparable efficiency and latency times. To help maximize your system's overall effectiveness and reliability, evenly distribute traffic across the models.

  • For optimal endpoint performance, monitor key Amazon CloudWatch metrics, such as ModelCacheHit and ModelLoadingWaitTime. When the ModelCacheHit rate is high and the ModelLoadingWaitTime rate is low, you're endpoint is efficiently managing invocations. 

Internal server exceptions

If a multi-model endpoint can't load models because of insufficient memory, then you might experience an internal server exception. For example, when you deploy a 14 GB model on an ml.t3.medium instance with only 4 GB of CPU memory, you get an error. To prevent errors, select endpoint instances with enough memory to accommodate multiple models during runtime.

AWS OFFICIAL
AWS OFFICIALUpdated 11 days ago