Extending Docker image for SageMaker Inference

0

I'm trying to create my own Docker image for use with SageMaker Batch Transform by extending an existing one. Following the documentation at https://docs.aws.amazon.com/sagemaker/latest/dg/prebuilt-containers-extend.html, I have created the following to run Detectron 2:

FROM 763104351884.dkr.ecr.eu-west-2.amazonaws.com/pytorch-inference:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker

############# Installing latest builds ############
RUN pip install --upgrade torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/torch_stable.html

ENV FORCE_CUDA="1"
# Build D2 only for Turing (G4) and Volta (P3) architectures. Use P3 for batch transforms and G4 for inference on endpoints
ENV TORCH_CUDA_ARCH_LIST="Turing;Volta"

# Install Detectron2
RUN pip install \
   --no-cache-dir pycocotools~=2.0.0 \
   --no-cache-dir https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/detectron2-0.6%2Bcu113-cp38-cp38-linux_x86_64.whl
   
# Set a fixed model cache directory. Detectron2 requirement
ENV FVCORE_CACHE="/tmp"

############# SageMaker section ##############

ENV PATH="/opt/ml/code:${PATH}"

COPY inference.py /opt/ml/code/inference.py

ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV SAGEMAKER_PROGRAM inference.py

I then create a model (create-model) with this image using the following configuration:

{
"ExecutionRoleArn": "arn:aws:iam::[redacted]:role/model-role",
"ModelName": "model-test",
"PrimaryContainer": { 
  "Environment": {
    "SAGEMAKER_PROGRAM": "inference.py",
    "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/code",
    "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
    "SAGEMAKER_REGION": "eu-west-2",
    "MMS_DEFAULT_RESPONSE_TIMEOUT": "500"
   },
  "Image": "[redacted].dkr.ecr.eu-west-2.amazonaws.com/my-image:latest",
  "ModelDataUrl": "s3://[redacted]/training/output/model.tar.gz"
}
}

And submit a batch transform job (create-transform-job) using the following configuration:

{
"MaxPayloadInMB": 16,
"ModelName": "model-test",
"TransformInput": { 
    "ContentType": "application/x-image",
    "DataSource": { 
      "S3DataSource": { 
          "S3DataType": "ManifestFile",
          "S3Uri": "s3://[redacted]/manifests/input.manifest"
      }
    }
},
"TransformJobName": "transform-test",
"TransformOutput": { 
    "S3OutputPath": "s3://[redacted]/predictions/"
},
"TransformResources": { 
    "InstanceCount": 1,
    "InstanceType": "ml.m5.large"
}
}

Both of the above commands submit fine, but the transform job doesn't complete. When I look in the logs, the errors I'm getting seem to indicate that it's not using my inference script (inference.py, specified above) but is instead using the default script (default_pytorch_inference_handler.py) and therefore can't find the model.

What am I missing so that it uses my inference script instead, and hence my model?

  • Can you test by including the inference.py script in the model tar ball instead of baking it into the image?

  • I got the same problem now. In the cloudwatch log, it tried to use default_pytorch_inference_handler.py instead of my inference.py. Did you manage to solve the problem yet?

jdbaker
질문됨 2년 전1489회 조회
1개 답변
0

Hello,

Thank you for using AWS SageMaker.

It is difficult to identify why this behavior is observed without any logs for the mentioned task under your account. Looking at the above snippet shared, I was able to identify that the extending docker image used is based on GPU instance "pytorch-inference:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker " and the Batch transform job that was created was using CPU instances i.e ("InstanceType": "ml.m5.large").

I'd recommend to fix that configuration and try running the batch transform job once again. If you still observe similar issue, I'd recommend you to reach out to AWS Support for further investigation of the issue along with all the details and logs as sharing logs is not recommended to share on this platform.

Open a support case with AWS using the link: https://console.aws.amazon.com/support/home?#/case/create

AWS
지원 엔지니어
답변함 2년 전
  • Thanks, I'll take a look at that and see if that makes a difference; although it seems not to be an issue with the inference so much as being unable to get the extended image to use my script over the default one.

    I do also have a support case open already - just hoping to get some other views/support too in order to get the issue resolved as soon as possible.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠