AWS SageMaker Endpoint Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check

1

Links to the AWS notebooks for reference https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/xgboost_bring_your_own_model/xgboost_bring_your_own_model.ipynb

https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/xgboost_script_mode_local_training_and_serving/code/inference.py

I am using the example from the notebooks to create and deploy an endpoint to AWS SageMaker Cloud. I have passed all the checks locally and when I attempt to deploy the endpoint I run into the issue.

Code

In my local notebook (my personal machine NOT sagemaker notebook):

    import pandas
    import xgboost
    from xgboost import XGBRegressor
    import numpy as np
    from sklearn.model_selection import train_test_split, RandomizedSearchCV
    
    print(xgboost.__version__)
    1.0.1

    # Fit model
    r.fit(X_train.toarray(), y_train.values)

    xgbest = r.best_estimator

AWS SageMaker Endpoint code

import boto3
import pickle
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
from time import gmtime, strftime

region = boto3.Session().region_name

role = 'arn:aws:iam::111:role/xxx-sagemaker-role'

bucket = 'ml-model'
prefix = "sagemaker/xxx-xgboost-byo"
bucket_path = "https://s3-{}.amazonaws.com/{}".format('us-west-1', 'ml-model')

client = boto3.client(
    's3',
    aws_access_key_id=xxx
    aws_secret_access_key=xxx
)
client.list_objects(Bucket=bucket)

Save the model

# save the model, either xgbest 
model_file_name = "xgboost-model"

# using save_model
# xgb_model.save_model(model_file_name)

pickle.dump(xgbest, open(model_file_name, 'wb'))`

!tar czvf xgboost_model.tar.gz $model_file_name

Upload to S3

key = 'xgboost_model.tar.gz'

with open('xgboost_model.tar.gz', 'rb') as f:
    client.upload_fileobj(f, bucket, key)

Import model

# Import model into hosting
container = get_image_uri(boto3.Session().region_name, "xgboost", "0.90-2")
print(container)

xxxxxx.dkr.ecr.us-west-1.amazonaws.com/sagemaker-xgboost:0.90-2-cpu-py3
%%time

model_name = model_file_name + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
model_url = "https://s3-{}.amazonaws.com/{}/{}".format(region, bucket, key)

from sagemaker.xgboost import XGBoost, XGBoostModel
from sagemaker.session import Session
from sagemaker.local import LocalSession


sm_client = boto3.client(
                         "sagemaker",
                         region_name="us-west-1",
                         aws_access_key_id='xxxx',
                         aws_secret_access_key='xxxx'
                        )

# Define session
sagemaker_session = Session(sagemaker_client = sm_client)

models3_uri = "s3://ml-model/xgboost_model.tar.gz"

xgb_inference_model = XGBoostModel(
                                   model_data=models3_uri,
                                   role=role,
                                   entry_point="inference.py",
                                   framework_version="0.90-2",
                                   # Cloud
                                   sagemaker_session = sagemaker_session
                                   # Local
                                   # sagemaker_session = None
           
)

#serializer = StringSerializer(content_type="text/csv")
predictor = xgb_inference_model.deploy(
                                       initial_instance_count = 1,
                                       # Cloud
                                       instance_type="ml.t2.large",
                                       # Local
                                       # instance_type = "local",
                                       serializer = "text/csv"
)


if xgb_inference_model.sagemaker_session.local_mode == True:
    print('Deployed endpoint in local mode')
else:
    print('Deployed endpoint to SageMaker AWS Cloud')


/Applications/Anaconda/anaconda3/lib/python3.9/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
   3354         if status != "InService":
   3355             reason = desc.get("FailureReason", None)
-> 3356             raise exceptions.UnexpectedStatusException(
   3357                 message="Error hosting endpoint {endpoint}: {status}. Reason: {reason}.".format(
   3358                     endpoint=endpoint, status=status, reason=reason

UnexpectedStatusException: Error hosting endpoint sagemaker-xgboost-xxxx: Failed. Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

asked 2 years ago5589 views
1 Answer
0

Please make sure that the trained model used was trained on same version of XGBoost that is used while deploying the endpoint.

Also verify there are no typo's in your script while deploying the endpoint.

I'd also check CloudWatch logs to find any information on the error encountered. If you are still not able to identify the issue, I'd recommend you to reach out to AWS Support for further investigation of the issue:

Open a support case with AWS using the link: https://console.aws.amazon.com/support/home?#/case/create

AWS
SUPPORT ENGINEER
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions