How to debug invocation timeout in sagemaker?

1

I am testing inference in sagemaker , by using one of the container listed here -> https://github.com/aws/deep-learning-containers/blob/master/available_images.md. the model is zipped up as below and with in inference.py file , i am overwriting functions like model_fn method and predict_fn. I tested this with batch transform and it worked but for few small input files but for other larger files, i keep getting "Model server did not respond to /invocations request within 3600 seconds" . I'm trying to find out what is the cause of it? 3600 is the max we can set for "invocation timeout in seconds" parameter and the default input size for batch is 6mb , the input files i'm using are way smaller than that but i still get that error.

Directory structure

model.tar.gz/
|- model.pth
|- code/
  |- inference.py
  |- requirements.txt  

file : inference.py

import torch
import os

def model_fn(model_dir):
    model = Your_Model()
    with open(os.path.join(model_dir, 'model.pth'), 'rb') as f:
        model.load_state_dict(torch.load(f))
    return model

def predict_fn():
    //

based on docs here, https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-batch-code.html#your-algorithms-batch-code-how-containers-should-respond-to-inferences, do we need to install flask and have an /invocations endpoint , that responds 200 ok , when we are using custom container?

1 Risposta
0

One of the best ways to debug a custom inference script would be to start off with using the SageMaker "local mode". Once you are sure that your script is working fine, move over to hosting on the SageMaker endpoint. Here are some of the examples to get started.

Example for a TF serving model that I have a custom Inference script, I would use local mode as shown below for my testing-

from sagemaker.tensorflow.model import TensorFlowModel
from sagemaker.local import LocalSession

tensorflow_serving_model = TensorFlowModel(
    model_data=model_data,
    role=sagemaker_role,
    framework_version="2.6",
  # sagemaker_session=sagemaker_session,
  sagemaker_session=LocalSession()
)
AWS
con risposta 2 anni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande