By using AWS re:Post, you agree to the AWS re:Post Terms of Use

'InternalServerError' using Sagemaker Pipelines with an extended pre-built container

0

I am following these docs (https://docs.aws.amazon.com/sagemaker/latest/dg/processing-container-run-scripts.html) to Run a script with my own processing container (I need to download a few custom packages) building a sagemaker pipeline in Sagemaker studio.

Using this docker file, I push it to ECR and copy the image uri to put into the pipeline definition:

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.10.0-gpu-py39-cu112-ubuntu20.04-sagemaker

# Install system dependencies
RUN apt-get update && apt-get install -y \
    ffmpeg \
    git \
 && rm -rf /var/lib/apt/lists/*

# Install Python dependencies directly
RUN pip install --no-cache-dir pandas openai pydub awswrangler git+https://github.com/openai/whisper.git

# Copy the rest of your application's code
# COPY . .

ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

Here is the step in the pipeline definition pipeline definition:

from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import ScriptProcessor
framework_version = "1.2-1"

# Define the ScriptProcessor for your custom container
transcription_processor = ScriptProcessor(
    command=["python3"],
    image_uri=whisper_image_uri, # The URI to my Custom Container in ECR
    instance_type='ml.t3.medium' ,
    instance_count=1,
    base_job_name="transcribe_podcasts",
    role=role,
    sagemaker_session=pipeline_session,
)

# Define processor arguments
transcription_processor_args = transcription_processor.run(
    inputs=[],  
    outputs=[],
    code="code/setup.sh", #transcribe_podcasts.py
    arguments=[
        "--show-name", show_name,
        "--xml-url", xml_url,
        "--default-bucket", default_bucket,
        "--whisper-model-size", whisper_model_size,
    ],
    wait=True
)

transcription_step = ProcessingStep(name="TranscriptionStep", step_args=transcription_processor_args)

When I run the pipeline execution I get the following error:

InternalServerError: InternalServerError: We encountered an internal error. Please try again. Retry not appropriate on execution of step with PipelineExecutionArn arn:aws:sagemaker:us-west-2:<accountid>:pipeline/<pipelinename>/execution/kxyxvhr8kawo and StepId TranscriptionStep. No retry policy configured for the exception type SAGEMAKER_JOB_INTERNAL_ERROR.

I have tried using this base image as the URI 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.10.0-gpu-py39-cu112-ubuntu20.04-sagemaker and it does not give me this "InternalServerError" which suggests to me that the error is due to the custom container I have built. Is there something that I am missing?

1 Answer
0

Hello, Thank you for reaching out. It is difficult to identify what the issue may be without further deep dive into job logs and understanding why the job may have failed with 'InternalServerError'.

Typically to extend a pre-built container in SageMaker, you need to declare the SAGEMAKER_SUBMIT_DIRECTORY and SAGEMAKER_PROGRAM environment variables. Please refer to the example Dockerfile here - https://github.com/aws/amazon-sagemaker-examples/blob/0efd885ef2a5c04929d10c5272681f4ca17dac17/advanced_functionality/pytorch_extend_container_train_deploy_bertopic/container/Dockerfile

You can also test the container image in local mode to check if its working as expected before deploying it to a job. If the issue persists, I would recommend reaching out to AWS on a support case along with the Job ARN and associated logs for further troubleshooting - https://console.aws.amazon.com/support/home?#/case/create

AWS
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions