How to debug invocation timeout in sagemaker?

1

I am testing inference in sagemaker , by using one of the container listed here -> https://github.com/aws/deep-learning-containers/blob/master/available_images.md. the model is zipped up as below and with in inference.py file , i am overwriting functions like model_fn method and predict_fn. I tested this with batch transform and it worked but for few small input files but for other larger files, i keep getting "Model server did not respond to /invocations request within 3600 seconds" . I'm trying to find out what is the cause of it? 3600 is the max we can set for "invocation timeout in seconds" parameter and the default input size for batch is 6mb , the input files i'm using are way smaller than that but i still get that error.

Directory structure

model.tar.gz/
|- model.pth
|- code/
  |- inference.py
  |- requirements.txt  

file : inference.py

import torch
import os

def model_fn(model_dir):
    model = Your_Model()
    with open(os.path.join(model_dir, 'model.pth'), 'rb') as f:
        model.load_state_dict(torch.load(f))
    return model

def predict_fn():
    //

based on docs here, https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-batch-code.html#your-algorithms-batch-code-how-containers-should-respond-to-inferences, do we need to install flask and have an /invocations endpoint , that responds 200 ok , when we are using custom container?

已提問 2 年前檢視次數 2550 次
1 個回答
0

One of the best ways to debug a custom inference script would be to start off with using the SageMaker "local mode". Once you are sure that your script is working fine, move over to hosting on the SageMaker endpoint. Here are some of the examples to get started.

Example for a TF serving model that I have a custom Inference script, I would use local mode as shown below for my testing-

from sagemaker.tensorflow.model import TensorFlowModel
from sagemaker.local import LocalSession

tensorflow_serving_model = TensorFlowModel(
    model_data=model_data,
    role=sagemaker_role,
    framework_version="2.6",
  # sagemaker_session=sagemaker_session,
  sagemaker_session=LocalSession()
)
AWS
已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南