How do I troubleshoot an algorithm error in my SageMaker AI processing job?

6 minute read
0

I want to troubleshoot an algorithm error that I receive when I run an Amazon SageMaker AI processing job.

Short description

When you run a SageMaker AI processing job, you might receive the "Failure reason AlgorithmError: , exit code: 1" error message for the following reasons:

  • There's a Python dependency issue.
  • A Python runtime error occurs, and you receive a non zero exit code.
  • You're using a docker container that's built for a different CPU architecture.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Identify the cause of the error

To identify the cause of the error, use Amazon CloudWatch Logs to review the stack trace log. The stack trace log shows what caused your code to fail and when the failure occurred in your processing script.

After you identify the cause of the error, manually run the container from a terminal. If you don't have docker installed, then use SageMaker AI notebook instances to run the container. Make sure to use the appropriate instance type.

Pull the container to your local environment

If you use a SageMaker AI prebuilt container, such as Scikit-learn, then log in to the registry and pull the container. To log in, run the get-login-password AWS CLI command:

aws ecr get-login-password --region $REGION | docker login -u AWS --password-stdin 683313688378.dkr.ecr.${REGION}.amazonaws.com

Note: Replace region with your AWS Region.

To pull the container, run the following command:

docker pull 683313688378.dkr.ecr.${REGION}.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3

Note: If you use SageMaker AI algorithms, such as Random Cut Forest, then you can't pull the container to your local environment. For these containers, contact AWS Support.

Create a bash shell in your container

To troubleshoot your code, run the container in your local environment. Create a folder that's named code, and then add your code to the folder. Mount the directory that includes your code folder to your container as a volume. Then, mount the directory output to the same location as shown in the following example:

docker run \
    -v ./code:/opt/ml/processing/input/code \
    -v ./output:/opt/ml/processing/output \
    --entrypoint '/bin/bash' \
    683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3

The preceding command creates a bash shell in your container. In the Troubleshoot algorithm errors section, update your code in the code folder, and then use the bash shell to run your processing script.

Troubleshoot algorithm errors

Python dependency issue

If you don't specify a Python package version, then pip installs the latest package that can cause an algorithm error. To troubleshoot this issue, identify your package version, and then install the packages in your processing script.

If you use a prebuilt SageMaker container, then find the repository for the source code. Retrieve the Python packages and versions that you use in the container. Or, run the following command in the bash shell to see the full list of packages for a custom container:

pip freeze | grep '==' > /opt/ml/processing/input/code/requirements.txt

After you identify your package version, create a requirements.txt file in the same directory as your processing.py script. Then, add the required packages to the requirements.txt file.

Example requirements.txt file that uses the SageMaker AI Scikit-learn repository:

# Scikit-learn 1.2-1 packages
boto3==1.28.57
botocore>=1.31.57,<1.32.0
cryptography
Flask==1.1.1
itsdangerous==2.0.1
gunicorn==20.0.4
model-archiver==1.0.3
multi-model-server==1.1.1
pandas==1.1.3
protobuf==3.20.2
psutil==5.7.2
python-dateutil==2.8.1
retrying==1.3.3
sagemaker-containers==2.8.6.post2
sagemaker-inference==1.2.0
sagemaker-training==4.8.0
scikit-learn==1.2.1
scipy==1.8.0
urllib3==1.26.17
six==1.15.0
jinja2==3.0.3
MarkupSafe==2.1.1
numpy==1.24.1
gevent==23.9.1
Werkzeug==2.0.3
setuptools
wheel
certifi

# Your packages and their versions
sagemaker==2.232.1
boto3==1.34.142
botocore>=1.34.142,<1.35.0

To install the required packages, run the pip command in your processing.py script:

import sys
import subprocess


def install(requirements: str) -> None:
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", "-r", requirements])
    

def install_requirements() -> None:
    install("/opt/ml/processing/input/code/requirements.txt")
    
    
def main() -> None:
    import sagemaker
    import pandas
    example_code


if __name__ == "__main__":
    install_requirements()
    main()

Note: Replace example_code with your code. Make sure that existing packages don't update.

To test whether you added the correct package versions to the processing.py script, run the following command in the bash shell:

cd /opt/ml/processing/input/code
python3 process.py

If the script fails, then continue to update the package versions until the script is successful.

Troubleshoot Python runtime issue

To troubleshoot a Python runtime issue, modify your code in the code folder to update your processing.py script. To test your processing.py script, run the following commands in the bash shell:

cd /opt/ml/processing/input/code
python3 processing.py 

Modify your code until you resolve all the issues in the stack trace log and you receive the "0" exit code.

Troubleshoot CPU architecture issue

You must maintain a container version that's built for the specific CPU. Pull the existing image, and create a new Dockerfile that includes only your container image (FROM IMAGE_URI). Then, push the container to your existing Amazon Elastic Container Registry (Amazon ECR).

The following example takes an image that's built for x86_64 and rebuilds the image for an ARM architecture:

#!/usr/bin/env bash
algorithm_name=ALGORITHM_NAME
sagemaker_account=SAGEMAKER_ACCOUNT_NUMBER
your_account=$(aws sts get-caller-identity --query Account --output text)
region=$(aws configure get region)
fullname="${your_account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:graviton-latest"

aws ecr get-login-password --region ${region} | docker login -u AWS --password-stdin ${sagemaker_account}.dkr.ecr.${region}.amazonaws.com
docker build -t ${algorithm_name} --platform=linux/arm64 .
docker tag ${algorithm_name} ${fullname}


aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi


aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${fullname}
docker push ${fullname}

Note: Replace ALGORITHM_NAME and SAGEMAKER_ACCOUNT_NUMBER with your values.

Test your processing job

To test your updated processing job, run the following command:

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput


role = get_execution_role()

local_mode = True

if local_mode:
    sklearn_processor = SKLearnProcessor(
        framework_version="1.2-1", role=role, instance_type="local", instance_count=1
    )
else:
    sklearn_processor = SKLearnProcessor(
        framework_version="1.2-1", role=role, instance_type="ml.m5.xlarge", instance_count=1
    )
                             
sklearn_processor.run(code='code/processing.py',
                      inputs=[ProcessingInput(
                          source='./code/',
                          destination='/opt/ml/processing/input/code/')],
                      outputs=[ProcessingOutput(
                          output_name='output',
                          source='/opt/ml/processing/output/')]
                     )

Note: Replace instance_type to local or local-gpu to run the job as a test.

AWS OFFICIAL
AWS OFFICIALUpdated 20 days ago