Model error of deployed sageMaker Endpoint

Question

After fine-tuning a DistilBERT model and saving it as 'model.pth', and creating an 'inference.py' script, I packaged both into a '.tar.gz' file. Upon deploying it, an endpoint was successfully created. However, encountered an error upon attempting to access the endpoint.

The error is: ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.",

**inference.py code is below**

```
import json
import torch
import logging
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import os

# Set up logging
logging.basicConfig(level=logging.INFO)

# Load tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Define model_fn function to load the model
def model_fn(model_dir):
    # Load the model architecture
    model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)  # Assuming you have 3 labels
    
    # Load the model state dictionary
    model_state_path = os.path.join(model_dir, 'model.pth')
    model.load_state_dict(torch.load(model_state_path, map_location=torch.device('cpu')))  # Load the model on CPU
    
    # Set the model in evaluation mode
    model.eval()
    
    return model

# Define the predict function
def predict(review_text, model):
    encoding = tokenizer.encode_plus(
        review_text,
        add_special_tokens=True,
        max_length=512,
        return_token_type_ids=False,
        padding='max_length',
        return_attention_mask=True,
        return_tensors='pt',
        truncation=True
    )

input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)

logits = outputs.logits  # Get the logits from the output
    prediction = torch.argmax(logits, dim=1).item()
    label_dict = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}
    sentiment = label_dict[prediction]

return sentiment

# Define input and output functions
def input_fn(input_data, content_type):
    logging.info("Input function invoked")
    if content_type == 'application/json':
        data = json.loads(input_data)
        return data['review_text']
    else:
        raise ValueError(f'Unsupported content type: {content_type}')

def output_fn(prediction_output, accept):
    logging.info("Output function invoked")
    return str(prediction_output)

def predict_fn(input_data, model):
    logging.info("Predict function invoked")
    return predict(input_data, model)

```

Answer

It looks like you're using either the [HuggingFace](https://huggingface.co/docs/sagemaker/inference) or [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#deploy-pytorch-models) *framework containers* to deploy your model.

### Tarball structure

In either case (as documented [here for PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#for-versions-1-2-and-higher), [here for HF](https://huggingface.co/docs/sagemaker/inference#user-defined-code-and-modules)), your `inference.py` code should be located in a `code/` **subfolder** of your model tarball, but it sounded from the question like you might have it in the root?

If you're preparing your tarball by hand, check also that it correctly extracts to the *current directory* `.`, and doesn't e.g. create a new subfolder `model/model.pth` when you unzip it... For example I've sometimes created them with command like `tar -czf ../model.tar.gz .` from inside the folder where I've prepared my artifacts.

### Model format

Since you're already providing a custom `model_fn`, you don't need to go to the effort of converting to a `model.pth` if you don't want... For HuggingFace models I find it's easier to just use [for example](https://github.com/aws-samples/amazon-textract-transformer-pipeline/blob/a0c035956d4bf42e288c88b65ffad237f13537cf/notebooks/src/code/train.py#L313) `Trainer.save_model()` to the target folder and then at inference time you can directly:

```python
model = DistilBertForSequenceClassification.from_pretrained(model_dir)
```

As shown in the linked example, I'd probably save the tokenizer in your tarball too, to avoid any hidden external dependencies.

### Debugging

Your endpoint's CloudWatch logs are usually the best place to look for what's going wrong with deployments like this, but I know the default configuration can be a bit sparse...

I'd suggest setting the `env={"PYTHONUNBUFFERED": "1"}` when you create your `HuggingFaceModel` to disable Python log buffering and ensure that logs from a crashing thread/process actually get written to CloudWatch before the thread/process dies. If you're just directly going from the shortcut `estimator.deploy()`, you'll need to change your code to create a Model first to be able to specify this parameter.

Deploying a SageMaker endpoint involves 3 API-side steps that the SDK makes a little non-obvious: Creating a "Model", an "Endpoint Configuration", and an "Endpoint". To make matters more confusing, creating an SDK e.g. `HuggingFaceModel` doesn't actually create a SageMaker `Model` yet because it doesn't have all the information yet (container URI is inferred from instance type, which the SDK only collects when you try to create a `Transformer` or `Predictor`). Be careful when re-trying different configurations to check (e.g. in the AWS Console for SageMaker) that your previous model and endpoint configuration are actually getting **deleted** and not re-used: Or you might feel like you're trying different things and seeing the same result.

Finally, you'll want to avoid waiting several minutes every time for an endpoint to deploy, to check your new configuration works. I'd recommend verifying your `inference.py` functions work as you'd expect them to against an extracted local folder containing the contents of your `model.tar.gz`... To test the whole stack in an environment where Docker is available, you can use `instance_type='local'` - [SageMaker Local Mode](https://github.com/aws-samples/amazon-sagemaker-local-mode). If you're working on notebooks in SageMaker Studio, note that Local Mode **didn't used to be** supported, but now is! You just have to [enable it](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local.html) and [install docker on the Studio instance](https://github.com/aws-samples/sagemaker-studio-apps-lifecycle-config-examples/pull/14).

Model error of deployed sageMaker Endpoint

Tarball structure

Model format

Debugging

Relevant content