My SageMaker Endpoint raises a server error: Worker died.

Question

# What I did
I am trying to create SageMaker Serverless Inference Endpoint using the following CloudFormation template.

```sample.yaml
AWSTemplateFormatVersion: "2010-09-09"

Parameters:
  ModelDataUrl:
    Type: "String"
    Default: "s3://path/to/model/model.tar.gz"

Resources:
  SageMakerExecutionRole:
      Type: "AWS::IAM::Role"
      Properties:
        RoleName: "SageMakerExecutionRole"
        AssumeRolePolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: "Allow"
              Principal:
                Service:
                  - "sagemaker.amazonaws.com"
              Action:
                - "sts:AssumeRole"
        Policies:
          - PolicyName: "S3GetObjectPolicy"
            PolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: "Allow"
                  Action:
                    - "s3:GetObject"
                  Resource:
                    - "arn:aws:s3:::path/to/model/*"
          - PolicyName: "ECRBatchGetItemPolicy"
            PolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: "Allow"
                  Action:
                    - "ecr:BatchCheckLayerAvailability"
                    - "ecr:BatchGetImage"
                    - "ecr:GetDownloadUrlForLayer"
                  Resource:
                    - "arn:aws:ecr:*:763104351884:repository/*"
          - PolicyName: "ECRAutholization"
            PolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: "Allow"
                  Action:
                    - "ecr:GetAuthorizationToken"
                  Resource:
                    - "*"
          - PolicyName: "Logging"
            PolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: "Allow"
                  Action:
                    - "logs:CreateLogGroup"
                    - "logs:CreateLogStream"
                    - "logs:GetLogEvents"
                    - "logs:PutLogEvents"
                  Resource:
                    - "*"
  SageMakerModel:
    Type: "AWS::SageMaker::Model"
    Properties:
      ExecutionRoleArn: !GetAtt "SageMakerExecutionRole.Arn"
      Containers:
        - Image: "763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-inference:2.2.0-cpu-py310-ubuntu20.04-sagemaker"
          ModelDataUrl: !Ref ModelDataUrl

SageMakerEndpointConfig:
    Type: "AWS::SageMaker::EndpointConfig"
    Properties:
      ProductionVariants:
        - ModelName: !GetAtt "SageMakerModel.ModelName"
          VariantName: "MediaClassificationEndpoint"
          ServerlessConfig:
            MaxConcurrency: 5
            MemorySizeInMB: 3072

SageMakerEndpoint:
    Type: "AWS::SageMaker::Endpoint"
    Properties:
      EndpointConfigName: !GetAtt "SageMakerEndpointConfig.EndpointConfigName"
      EndpointName: "SageMakerEndpoint"
```

I have uploaded `model.tar.gz` to s3, which have the following structure:

```
.
├── model.pth
└── code/
    ├── inference.py
    └── requirements.txt
```

And here is the `inference.py`:

```inference.py
import base64
import json
from logging import getLogger
from pathlib import Path

import timm
import torch
from PIL import Image

WEIGHT_KEY = 'classifier.weight'

logger = getLogger(__name__)

def model_fn(
    model_dir: str,
) -> torch.nn.Module:
    model_path = Path(model_dir) / 'model.pth'
    assert model_path.exists(), f'{model_path} does not exist'
    state_dict = torch.load(Path(model_dir) / 'model.pth')
    num_classes = state_dict[WEIGHT_KEY].shape[0]
    model = timm.create_model(
        'tf_efficientnetv2_s.in21k',
        num_classes=num_classes,
    )
    logger.info('Model created')
    model.load_state_dict(state_dict)
    logger.info(f'Model loaded from {model_path}')
    model = model.eval()

return model

...
```

# Problem
I tested the endpoint using SageMaker Studio, but I encountered the following error:
```
Received server error (500) from model with message "{
  "code": 500,
  "type": "InternalServerException",
  "message": "Worker died."
}
```

Here is the CloudWatch logs:
```
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,768 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,768 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 253, in 
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,768 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,768 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 221, in run_server
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 184, in handle_connection
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_loader.py", line 135, in load
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - initialize_fn(service.context)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/handler_service.py", line 51, in initialize
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().initialize(context)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._service.validate_and_initialize(model_dir=model_dir, context=context)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,769 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/sagemaker_inference/transformer.py", line 184, in validate_and_initialize
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,770 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._model = self._run_handler_function(self._model_fn, *(model_dir,))
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,770 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/sagemaker_inference/transformer.py", line 272, in _run_handler_function
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,770 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - result = func(*argv)
2024-05-13T07:01:21.771Z	2024-05-13T07:01:21,771 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2024-05-13T07:01:21.773Z	2024-05-13T07:01:21,772 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
```

Could you tell me the cause of the error and how to solve the problem?
Thank you in advance.

Answer

I can't see any obvious error in your setup... But there are a couple of things I'd suggest to help debug:

First, I'd set the `PYTHONUNBUFFERED` environment variable to `'1'` or similar - which you can do in your [AWS::SageMaker::Model's ContainerDefinition](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-sagemaker-model-containerdefinition.html). Forcing Python to run unbuffered I/O will increase the likelihood of error info making it to CloudWatch before the thread that generated them gets killed. I might also temporarily try switching/augmenting your `logger.info` calls for `print`s, just in case configured log level is hiding them or something.

Next, I'd test your setup using the [SageMaker Python SDK's "Local Mode"](https://github.com/aws-samples/amazon-sagemaker-local-mode). This tries running your inference container locally via Docker+Compose, so (after the first pull) gives a much faster debugging cycle than trying to create a remote Endpoint each time. You'd need an environment where Python & the `sagemaker` SDK are installed to use this. If you're using SageMaker Studio, you'll need to [enable](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local.html) and [install](https://github.com/aws-samples/sagemaker-studio-apps-lifecycle-config-examples/pull/14) Docker. If you're not already using Studio but don't have Python/Docker locally, it might be fastest to just create a [SageMaker Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html) to try it out. If/when you use the SageMaker Python SDK for testing, do watch out that it doesn't start **caching** any Models or InferenceConfigurations on the API side (can check the AWS Console and/or explicitly specify a new name every time to help avoid).

It looks like you've already got this sorted, but maybe double-check your `model.tar.gz` extracts e.g. `model.pth` to the current folder: Doesn't create a new `model/` subdirectory containing everything... I'd put some kind of `print` right at the top of your `inference.py` too, to validate it's actually getting imported (which it should be if it's at `./code/inference.py` in your tarball)

My SageMaker Endpoint raises a server error: Worker died.

What I did

Problem

相關內容