Complete a 3 Question Survey and Earn a re:Post Badge
Help improve AWS Support Official channel in re:Post and share your experience - complete a quick three-question survey to earn a re:Post badge!
How do I troubleshoot an algorithm error in my SageMaker AI processing job?
I want to troubleshoot an algorithm error that I receive when I run an Amazon SageMaker AI processing job.
Short description
When you run a SageMaker AI processing job, you might receive the "Failure reason AlgorithmError: , exit code: 1" error message for the following reasons:
- There's a Python dependency issue.
- A Python runtime error occurs, and you receive a non zero exit code.
- You're using a docker container that's built for a different CPU architecture.
Resolution
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.
Identify the cause of the error
To identify the cause of the error, use Amazon CloudWatch Logs to review the stack trace log. The stack trace log shows what caused your code to fail and when the failure occurred in your processing script.
After you identify the cause of the error, manually run the container from a terminal. If you don't have docker installed, then use SageMaker AI notebook instances to run the container. Make sure to use the appropriate instance type.
Pull the container to your local environment
If you use a SageMaker AI prebuilt container, such as Scikit-learn, then log in to the registry and pull the container. To log in, run the get-login-password AWS CLI command:
aws ecr get-login-password --region $REGION | docker login -u AWS --password-stdin 683313688378.dkr.ecr.${REGION}.amazonaws.com
Note: Replace region with your AWS Region.
To pull the container, run the following command:
docker pull 683313688378.dkr.ecr.${REGION}.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3
Note: If you use SageMaker AI algorithms, such as Random Cut Forest, then you can't pull the container to your local environment. For these containers, contact AWS Support.
Create a bash shell in your container
To troubleshoot your code, run the container in your local environment. Create a folder that's named code, and then add your code to the folder. Mount the directory that includes your code folder to your container as a volume. Then, mount the directory output to the same location as shown in the following example:
docker run \ -v ./code:/opt/ml/processing/input/code \ -v ./output:/opt/ml/processing/output \ --entrypoint '/bin/bash' \ 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3
The preceding command creates a bash shell in your container. In the Troubleshoot algorithm errors section, update your code in the code folder, and then use the bash shell to run your processing script.
Troubleshoot algorithm errors
Python dependency issue
If you don't specify a Python package version, then pip installs the latest package that can cause an algorithm error. To troubleshoot this issue, identify your package version, and then install the packages in your processing script.
If you use a prebuilt SageMaker container, then find the repository for the source code. Retrieve the Python packages and versions that you use in the container. Or, run the following command in the bash shell to see the full list of packages for a custom container:
pip freeze | grep '==' > /opt/ml/processing/input/code/requirements.txt
After you identify your package version, create a requirements.txt file in the same directory as your processing.py script. Then, add the required packages to the requirements.txt file.
Example requirements.txt file that uses the SageMaker AI Scikit-learn repository:
# Scikit-learn 1.2-1 packages boto3==1.28.57 botocore>=1.31.57,<1.32.0 cryptography Flask==1.1.1 itsdangerous==2.0.1 gunicorn==20.0.4 model-archiver==1.0.3 multi-model-server==1.1.1 pandas==1.1.3 protobuf==3.20.2 psutil==5.7.2 python-dateutil==2.8.1 retrying==1.3.3 sagemaker-containers==2.8.6.post2 sagemaker-inference==1.2.0 sagemaker-training==4.8.0 scikit-learn==1.2.1 scipy==1.8.0 urllib3==1.26.17 six==1.15.0 jinja2==3.0.3 MarkupSafe==2.1.1 numpy==1.24.1 gevent==23.9.1 Werkzeug==2.0.3 setuptools wheel certifi # Your packages and their versions sagemaker==2.232.1 boto3==1.34.142 botocore>=1.34.142,<1.35.0
To install the required packages, run the pip command in your processing.py script:
import sys import subprocess def install(requirements: str) -> None: subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", "-r", requirements]) def install_requirements() -> None: install("/opt/ml/processing/input/code/requirements.txt") def main() -> None: import sagemaker import pandas example_code if __name__ == "__main__": install_requirements() main()
Note: Replace example_code with your code. Make sure that existing packages don't update.
To test whether you added the correct package versions to the processing.py script, run the following command in the bash shell:
cd /opt/ml/processing/input/code python3 process.py
If the script fails, then continue to update the package versions until the script is successful.
Troubleshoot Python runtime issue
To troubleshoot a Python runtime issue, modify your code in the code folder to update your processing.py script. To test your processing.py script, run the following commands in the bash shell:
cd /opt/ml/processing/input/code python3 processing.py
Modify your code until you resolve all the issues in the stack trace log and you receive the "0" exit code.
Troubleshoot CPU architecture issue
You must maintain a container version that's built for the specific CPU. Pull the existing image, and create a new Dockerfile that includes only your container image (FROM IMAGE_URI). Then, push the container to your existing Amazon Elastic Container Registry (Amazon ECR).
The following example takes an image that's built for x86_64 and rebuilds the image for an ARM architecture:
#!/usr/bin/env bash algorithm_name=ALGORITHM_NAME sagemaker_account=SAGEMAKER_ACCOUNT_NUMBER your_account=$(aws sts get-caller-identity --query Account --output text) region=$(aws configure get region) fullname="${your_account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:graviton-latest" aws ecr get-login-password --region ${region} | docker login -u AWS --password-stdin ${sagemaker_account}.dkr.ecr.${region}.amazonaws.com docker build -t ${algorithm_name} --platform=linux/arm64 . docker tag ${algorithm_name} ${fullname} aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1 if [ $? -ne 0 ] then aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null fi aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${fullname} docker push ${fullname}
Note: Replace ALGORITHM_NAME and SAGEMAKER_ACCOUNT_NUMBER with your values.
Test your processing job
To test your updated processing job, run the following command:
import boto3 import sagemaker from sagemaker import get_execution_role from sagemaker.sklearn.processing import SKLearnProcessor from sagemaker.processing import ProcessingInput, ProcessingOutput role = get_execution_role() local_mode = True if local_mode: sklearn_processor = SKLearnProcessor( framework_version="1.2-1", role=role, instance_type="local", instance_count=1 ) else: sklearn_processor = SKLearnProcessor( framework_version="1.2-1", role=role, instance_type="ml.m5.xlarge", instance_count=1 ) sklearn_processor.run(code='code/processing.py', inputs=[ProcessingInput( source='./code/', destination='/opt/ml/processing/input/code/')], outputs=[ProcessingOutput( output_name='output', source='/opt/ml/processing/output/')] )
Note: Replace instance_type to local or local-gpu to run the job as a test.

Relevant content
- asked 2 years agolg...
- Accepted Answerasked 5 years agolg...
- asked 2 years agolg...
- asked 3 years agolg...