Error Creating Endpoint

0

Hi! The following error happens while trying to create an endpoint from a successful trained model:

  • In the web console:

The customer:primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.

  • CloudWatch logs:

exec: "serve": executable file not found in $PATH

Im deploying the model using a Lambda step, just as in this notebook. The Lambda step is successful, and I can see in the AWS web console that the model configuration is created with success.

The exact same error happens when I create an endpoint for the registered model in the AWS web console, under Inference -> Models. In the console I can see that an inference container was created for the model, with the following characteristics:

  • Image: 763104351884.dkr.ecr.eu-west-3.amazonaws.com/tensorflow-training:2.8-cpu-py39
  • Mode: single model
  • Environment variables (Key Value):

SAGEMAKER_CONTAINER_LOG_LEVEL 20

SAGEMAKER_PROGRAM inference.py

SAGEMAKER_REGION eu-west-3

SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/model/code

I absolutely have no clue what is wrong and I could not find anything relevant online about this problem. Is it necessary to provide an custom docker image for inference or something?

For more details, please find below the pipeline model steps code. Any help would be much appreciated!

model = Model(
    image_uri=estimator.training_image_uri(),
    model_data=step_training.properties.ModelArtifacts.S3ModelArtifacts,
    sagemaker_session=sagemaker_session,
    role=sagemaker_role,
    source_dir='code',
    entry_point='inference.py'
)
step_model_create = ModelStep(
        name="CreateModelStep",
        step_args=model.create(instance_type="ml.m5.large")
 )

register_args = model.register(
        content_types=["*"],
        response_types=["application/json"],
        inference_instances=["ml.m5.large"],
        transform_instances=["ml.m5.large"],
        model_package_group_name="test",
        approval_status="Approved"
)
step_model_register = ModelStep(name="RegisterModelStep", step_args=register_args)
profile picture
asked a year ago312 views
1 Answer
1
Accepted Answer

Hi, the problem here is that your inference model's container URI 763104351884.dkr.ecr.eu-west-3.amazonaws.com/tensorflow-training:2.8-cpu-py39 is using a training image, not an inference image for TensorFlow. Because the images are each optimized for their own function, the serving executable is not available in the training container in this case.

Usually, the framework-specific SDK classes will handle this lookup for you (for example TensorFlowModel(...) as used in the notebook you linked, or when calling sagemaker.tensorflow.TensorFlow.deploy(...) from the Estimator class.

I see here though that you're using the generic Model, so guess you don't know (or don't want to commit to) the framework and version at the point the Lambda function runs?

My suggestions would be:

  • Can you use the Pipelines ModelStep to create your model before calling the Lambda deployment function? Similarly to how your linked notebook uses CreateModelStep. This would build your framework & version into the pipeline definition itself, but should mean that the selection of inference container image gets handled properly & automatically.
  • If you really need to be dynamic, I think you might need to find a way of looking up at least the framework from the training job. From my testing, you can use estimator = sagemaker.tensorflow.TensorFlow.attach("training-job-name") and then model = estimator.create_model(...) to correctly infer the specific inference container version from a training job, but it still relies on knowing that TensorFlow is the correct framework. I'm not aware of a framework-agnostic equivalent? So could e.g. try describing the training job, manually inferring which framework it uses from that information, and then using the relevant framework estimator class' attach() method to figure out the specifics and create your model.
AWS
EXPERT
Alex_T
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions