My SageMaker Endpoint raises a server error: Worker died.

0

What I did

I am trying to create SageMaker Serverless Inference Endpoint using the following CloudFormation template.

AWSTemplateFormatVersion: "2010-09-09"

Parameters:
  ModelDataUrl:
    Type: "String"
    Default: "s3://path/to/model/model.tar.gz"

Resources:
  SageMakerExecutionRole:
      Type: "AWS::IAM::Role"
      Properties:
        RoleName: "SageMakerExecutionRole"
        AssumeRolePolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: "Allow"
              Principal:
                Service:
                  - "sagemaker.amazonaws.com"
              Action:
                - "sts:AssumeRole"
        Policies:
          - PolicyName: "S3GetObjectPolicy"
            PolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: "Allow"
                  Action:
                    - "s3:GetObject"
                  Resource:
                    - "arn:aws:s3:::path/to/model/*"
          - PolicyName: "ECRBatchGetItemPolicy"
            PolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: "Allow"
                  Action:
                    - "ecr:BatchCheckLayerAvailability"
                    - "ecr:BatchGetImage"
                    - "ecr:GetDownloadUrlForLayer"
                  Resource:
                    - "arn:aws:ecr:*:763104351884:repository/*"
          - PolicyName: "ECRAutholization"
            PolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: "Allow"
                  Action:
                    - "ecr:GetAuthorizationToken"
                  Resource:
                    - "*"
          - PolicyName: "Logging"
            PolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: "Allow"
                  Action:
                    - "logs:CreateLogGroup"
                    - "logs:CreateLogStream"
                    - "logs:GetLogEvents"
                    - "logs:PutLogEvents"
                  Resource:
                    - "*"
  SageMakerModel:
    Type: "AWS::SageMaker::Model"
    Properties:
      ExecutionRoleArn: !GetAtt "SageMakerExecutionRole.Arn"
      Containers:
        - Image: "763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-inference:2.2.0-cpu-py310-ubuntu20.04-sagemaker"
          ModelDataUrl: !Ref ModelDataUrl

  SageMakerEndpointConfig:
    Type: "AWS::SageMaker::EndpointConfig"
    Properties:
      ProductionVariants:
        - ModelName: !GetAtt "SageMakerModel.ModelName"
          VariantName: "MediaClassificationEndpoint"
          ServerlessConfig:
            MaxConcurrency: 5
            MemorySizeInMB: 3072

  SageMakerEndpoint:
    Type: "AWS::SageMaker::Endpoint"
    Properties:
      EndpointConfigName: !GetAtt "SageMakerEndpointConfig.EndpointConfigName"
      EndpointName: "SageMakerEndpoint"

I have uploaded model.tar.gz to s3, which have the following structure:

.
├── model.pth
└── code/
    ├── inference.py
    └── requirements.txt

And here is the inference.py:

import base64
import json
from logging import getLogger
from pathlib import Path

import timm
import torch
from PIL import Image

WEIGHT_KEY = 'classifier.weight'

logger = getLogger(__name__)


def model_fn(
    model_dir: str,
) -> torch.nn.Module:
    model_path = Path(model_dir) / 'model.pth'
    assert model_path.exists(), f'{model_path} does not exist'
    state_dict = torch.load(Path(model_dir) / 'model.pth')
    num_classes = state_dict[WEIGHT_KEY].shape[0]
    model = timm.create_model(
        'tf_efficientnetv2_s.in21k',
        num_classes=num_classes,
    )
    logger.info('Model created')
    model.load_state_dict(state_dict)
    logger.info(f'Model loaded from {model_path}')
    model = model.eval()

    return model

...

Problem

I tested the endpoint using SageMaker Studio, but I encountered the following error:

Received server error (500) from model with message "{
  "code": 500,
  "type": "InternalServerException",
  "message": "Worker died."
}

Here is the CloudWatch logs:

2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,768 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,768 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 253, in <module>
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,768 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,768 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 221, in run_server
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 184, in handle_connection
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_loader.py", line 135, in load
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - initialize_fn(service.context)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/handler_service.py", line 51, in initialize
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().initialize(context)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._service.validate_and_initialize(model_dir=model_dir, context=context)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/sagemaker_inference/transformer.py", line 184, in validate_and_initialize
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,770 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._model = self._run_handler_function(self._model_fn, *(model_dir,))
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,770 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/sagemaker_inference/transformer.py", line 272, in _run_handler_function
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,770 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - result = func(*argv)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,771 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2024-05-13T07:01:21.773Z	2024-05-13T07:01:21,772 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.

Could you tell me the cause of the error and how to solve the problem? Thank you in advance.

Itto
已提問 1 個月前檢視次數 208 次
1 個回答
0

I can't see any obvious error in your setup... But there are a couple of things I'd suggest to help debug:

First, I'd set the PYTHONUNBUFFERED environment variable to '1' or similar - which you can do in your AWS::SageMaker::Model's ContainerDefinition. Forcing Python to run unbuffered I/O will increase the likelihood of error info making it to CloudWatch before the thread that generated them gets killed. I might also temporarily try switching/augmenting your logger.info calls for prints, just in case configured log level is hiding them or something.

Next, I'd test your setup using the SageMaker Python SDK's "Local Mode". This tries running your inference container locally via Docker+Compose, so (after the first pull) gives a much faster debugging cycle than trying to create a remote Endpoint each time. You'd need an environment where Python & the sagemaker SDK are installed to use this. If you're using SageMaker Studio, you'll need to enable and install Docker. If you're not already using Studio but don't have Python/Docker locally, it might be fastest to just create a SageMaker Notebook Instance to try it out. If/when you use the SageMaker Python SDK for testing, do watch out that it doesn't start caching any Models or InferenceConfigurations on the API side (can check the AWS Console and/or explicitly specify a new name every time to help avoid).

It looks like you've already got this sorted, but maybe double-check your model.tar.gz extracts e.g. model.pth to the current folder: Doesn't create a new model/ subdirectory containing everything... I'd put some kind of print right at the top of your inference.py too, to validate it's actually getting imported (which it should be if it's at ./code/inference.py in your tarball)

AWS
專家
Alex_T
已回答 15 天前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南