I am following these docs (https://docs.aws.amazon.com/sagemaker/latest/dg/processing-container-run-scripts.html) to Run a script with my own processing container (I need to download a few custom packages) building a sagemaker pipeline in Sagemaker studio.
Using this docker file, I push it to ECR and copy the image uri to put into the pipeline definition:
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.10.0-gpu-py39-cu112-ubuntu20.04-sagemaker
# Install system dependencies
RUN apt-get update && apt-get install -y \
ffmpeg \
git \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies directly
RUN pip install --no-cache-dir pandas openai pydub awswrangler git+https://github.com/openai/whisper.git
# Copy the rest of your application's code
# COPY . .
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]
Here is the step in the pipeline definition pipeline definition:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import ScriptProcessor
framework_version = "1.2-1"
# Define the ScriptProcessor for your custom container
transcription_processor = ScriptProcessor(
command=["python3"],
image_uri=whisper_image_uri, # The URI to my Custom Container in ECR
instance_type='ml.t3.medium' ,
instance_count=1,
base_job_name="transcribe_podcasts",
role=role,
sagemaker_session=pipeline_session,
)
# Define processor arguments
transcription_processor_args = transcription_processor.run(
inputs=[],
outputs=[],
code="code/setup.sh", #transcribe_podcasts.py
arguments=[
"--show-name", show_name,
"--xml-url", xml_url,
"--default-bucket", default_bucket,
"--whisper-model-size", whisper_model_size,
],
wait=True
)
transcription_step = ProcessingStep(name="TranscriptionStep", step_args=transcription_processor_args)
When I run the pipeline execution I get the following error:
InternalServerError: InternalServerError: We encountered an internal error. Please try again. Retry not appropriate on execution of step with PipelineExecutionArn arn:aws:sagemaker:us-west-2:<accountid>:pipeline/<pipelinename>/execution/kxyxvhr8kawo and StepId TranscriptionStep. No retry policy configured for the exception type SAGEMAKER_JOB_INTERNAL_ERROR.
I have tried using this base image as the URI 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.10.0-gpu-py39-cu112-ubuntu20.04-sagemaker
and it does not give me this "InternalServerError" which suggests to me that the error is due to the custom container I have built. Is there something that I am missing?