[problem at MMS predict] At MMS(sagemaker), error code(500), type(InternalServerException)

Question

I make pytorch model with sagemaker, MMS.
This is my mms code.
```python
%%time
instance_type = 'c5.large'
# accelerator_type = 'eia2.medium'
predictor = mme.deploy(
    initial_instance_count=1,
    instance_type=f"ml.{instance_type}"
)

mme.add_model(model_data_source=model_path, model_data_path="model.tar.gz")
list(mme.list_models())
#> [ 'model.tar.gz']
```

I try to predict with this code.

```python
start_time = time.time()
predicted_value = predictor.predict(requests, target_model="LV1")
duration = time.time() - start_time
print("${:,.2f}, took {:,d} ms
".format(predicted_value[0], int(duration * 1000)))
```

And, return error message.

```
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "{
  "code": 500,
  "type": "InternalServerException",
  "message": "Failed to start workers"
}
```

MMS with pytorch is 'little' difficult. X)

help me, please.

Accepted Answer

Hi ,
I think your target model on the prediction needs to have the name of the model you have deployed - 
for example , when you are adding the model with 
mme.add_model(model_data_source=model_path, model_data_path="model.tar.gz")
the model_data_path contains the name of the model . From the sagemaker-examples:
(https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.ipynb)
**model_data_path is the relative path to the S3 prefix we specified above (i.e. model_data_prefix) where our endpoint will source models for inference requests.Since this is a relative path, we can simply pass the name of what we wish to call the model artifact at inference time (i.e. Chicago_IL.tar.gz). In your case "model.tar.gz".
However, when predicting you call the model ,target_model="LV1"?

Answer

Accoding to your comment, I modify code and excution.
I try 2 solution.

#1 predictor.predict
```python
predicted_value = predictor.predict(data=requests, target_model="modal.tar.gz")
```

return
```
ValidationError: An error occurred (ValidationError) when calling the InvokeEndpoint operation: Failed to download model data(bucket: sagemaker-ap-northeast-2-344487737937, key: LouisVuiotton-cpu-2022-08-16-02-02-04-408-c6i-large/model/modal.tar.gz). Please ensure that there is an object located at the URL and that the role passed to CreateModel has permissions to download the model.
```

#2 With boto3, invoke_endpoint()
```python
import boto3

client = boto3.client('sagemaker-runtime')
endpoint_name = predictor.endpoint_name
response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=requests,
    ContentType='application/x-image',
#     Accept='string',
#     CustomAttributes='string',
    TargetModel='model.tar.gz',
#     TargetVariant='string',
#     TargetContainerHostname='string',
#     InferenceId='string'
)
```

return
```
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "{
  "code": 500,
  "type": "InternalServerException",
  "message": "Failed to start workers"
}
". See https://ap-northeast-2.console.aws.amazon.com/cloudwatch/home?region=ap-northeast-2#logEventViewer:group=/aws/sagemaker/Endpoints/LV-multi-2022-08-16-02-11-15 in account 344487737937 for more information.
```

I assume sol 2, boto3.invoke_endpoint's result [  "message": "Failed to start workers" ] come from sol 1, [that the role passed to CreateModel has permissions to download the model.].

I already use excution role [''arn:aws:iam::344487737937:role/service-role/AmazonSageMaker-ExecutionRole-20220713T151818"].
How to I get additional role (_that the role passed to CreateModel has permissions to download the model._)?

[problem at MMS predict] At MMS(sagemaker), error code(500), type(InternalServerException)

Contenido relevante