Error using Sagemaker, Custom Triton Container, Huggingface/Pytorch Sagemaker Pre Built Docker Image

0

I'm using Sagemaker to host a multi-container endpoint which includes a multi-model container and a post-processing single-model container: I'm setting this up as so:

mme_container = {
    "Image": mme_triton_image_uri,
    "ModelDataUrl": model_data_url,
    "Mode": "MultiModel",
    "Environment": {
        "SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT": "0.8",
    }
}

torch_container = {
    'Image': '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04',
    'ModelDataUrl': '{bucket_url}/post_process.tar.gz'
}
instance_type = "ml.g5.xlarge"
response = sm_client.create_model(
              ModelName        = serial_model_name,
              ExecutionRoleArn = role,
              Containers       = [mme_container,torch_container]
)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": serial_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

Our Pre-Build Triton Docker Container Extension DockerBuild file:

# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tritonserver:22.07-py3
# FROM 301217895009.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:22.07-py3

LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true

ENV SAGEMAKER_MULTI_MODEL=true
ENV SAGEMAKER_BIND_TO_PORT=8080

EXPOSE 8080

RUN pip install -U pip

RUN pip install --upgrade diffusers==0.25.0 transformers==4.36.1 accelerate numpy xformers scipy omegaconf torch torchvision pytorch_lightning pynvml

RUN pip install git+https://github.com/sberbank-ai/Real-ESRGAN.git

RUN apt-get update && apt-get install ffmpeg libsm6 libxext6  -y

The Errors: The endpoint is in the creating status for about 1-2 hours, and in that time it follows the following pattern:

  • There are no logs from either container_1 or container-2 for the first ~15-30 minutes
  • When the logs do finally show up, it is only from container-2 all the way until the endpoint fails

Interestingly enough, when using old PyTorch or hugging-face docker images, the containers both load successfully.

We've tried various things such as:

  • increasing the instance_type to 4x large
  • Adding various environment variables into the MME container such as: 'SAGEMAKER_PROGRAM': '', 'SAGEMAKER_SUBMIT_DIRECTORY': '',"SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT": "0.8", "SAGEMAKER_MULTI_MODEL": "true", "SM_LOG_LEVEL": "10"

Through the various things we've done, the only way we've managed to get logs from container_1 were by using the following docker images: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.7.1-transformers4.6.1-gpu-py36-cu110-ubuntu18.04 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.3-gpu-py3

And while the above pre-built docker images worked with our custom extended sagemaker-triton docker image, they were too old to handle the necessary requirements of our model.

Any help as to debugging this issue would be greatly appreciated.

CS Ayo
asked 2 months ago47 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions