How to debug invocation timeout in sagemaker?

Question

I am testing inference in sagemaker , by using one of the container listed here ->  https://github.com/aws/deep-learning-containers/blob/master/available_images.md. the model is zipped up as below  and with in  inference.py file , i am overwriting functions like  model_fn method and predict_fn. I tested this with batch transform and it worked but for few small input files but for other larger files, i keep getting "Model server did not respond to /invocations request within 3600 seconds" . I'm trying to find out what is the cause of it? 3600 is the max we can set for "invocation timeout in seconds" parameter and the default input size for batch  is 6mb , the input files i'm using are way smaller than that but i still get that error.

Directory structure 
```
model.tar.gz/
|- model.pth
|- code/
  |- inference.py
  |- requirements.txt

```
file : inference.py
```
import torch
import os

def model_fn(model_dir):
    model = Your_Model()
    with open(os.path.join(model_dir, 'model.pth'), 'rb') as f:
        model.load_state_dict(torch.load(f))
    return model

def predict_fn():
    //
```
 based on docs here, https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-batch-code.html#your-algorithms-batch-code-how-containers-should-respond-to-inferences, do we need to install flask and have an /invocations endpoint , that responds 200 ok , when we are using custom container?

Answer

One of the best ways to debug a custom inference script would be to start off with using the SageMaker "local mode". Once you are sure that your script is working fine, move over to hosting on the SageMaker endpoint.
[Here](https://github.com/aws-samples/amazon-sagemaker-local-mode) are some of the examples to get started.

Example for a TF serving model that I have a custom Inference script, I would  use local mode as shown below  for my testing-

```
from sagemaker.tensorflow.model import TensorFlowModel
from sagemaker.local import LocalSession

tensorflow_serving_model = TensorFlowModel(
    model_data=model_data,
    role=sagemaker_role,
    framework_version="2.6",
  # sagemaker_session=sagemaker_session,
  sagemaker_session=LocalSession()
)
```

How to debug invocation timeout in sagemaker?

Contenuto pertinente