Sagemaker training job successful but model not uploaded to s3

0

Ok I've been dealing with this issue in Sagemaker for almost a week and I'm ready to pull my hair out. I've got a custom training script paired with a data processing script in a BYO algorithm Docker deployment type scenario. It's a Pytorch model built with Python 3.x, and the BYO Docker file was originally built for Python 2, but I can't see an issue with the problem that I am having.....which is that after a successful training run Sagemaker doesn't save the model to the target S3 bucket.

I've searched far and wide and can't seem to find an applicable answer anywhere. This is all done inside a Notebook instance. Note: I am using this as a contractor and don't have full permissions to the rest of AWS, including downloading the Docker image.

Dockerfile:

FROM ubuntu:18.04

MAINTAINER Amazon AI <sage-learner@amazon.com>

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         python-pip \
         python3-pip3
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py && \
    pip3 install future numpy torch scipy scikit-learn pandas flask gevent gunicorn && \
        rm -rf /root/.cache

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

COPY decision_trees /opt/program
WORKDIR /opt/program

Docker Image Build:

%%sh

algorithm_name="name-this-algo"

cd container

chmod +x decision_trees/train
chmod +x decision_trees/serve

account=$(aws sts get-caller-identity --query Account --output text)

region=$(aws configure get region)
region=${region:-us-east-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Env setup and session start:

common_prefix = "pytorch-lstm"
training_input_prefix = common_prefix + "/training-input-data"
batch_inference_input_prefix = common_prefix + "/batch-inference-input-data"

import os
from sagemaker import get_execution_role
import sagemaker as sage

sess = sage.Session()

role = get_execution_role()
print(role)

Training Directory, Image, and Estimator Setup, then a fit call:

TRAINING_WORKDIR = "a/local/directory"

training_input = sess.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print ("Training Data Location " + training_input)

account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/image-that-works:working'.format(account, region)

tree = sage.estimator.Estimator(image,
                       role, 1, 'ml.p2.xlarge',
                       output_path="s3://sagemaker-directory-that-definitely/exists",
                       sagemaker_session=sess)

tree.fit(training_input)

The above script is working, for sure. I have print statements in my script and they are printing the expected results to the console. This runs as it's supposed to, finishes up, and says that it's deploying model artifacts when IT DEFINITELY DOES NOT.

Model Deployment:

model = tree.create_model()
predictor = tree.deploy(1, 'ml.m4.xlarge')

This throws an error that the model can't be found. A call to aws sagemaker describe-training-job shows that the training was completed but I found that the time it took to upload the model was super fast, so obviously there's an error somewhere and it's not telling me. Thankfully it's not just uploading it to the aether.

{
            "Status": "Uploading",
            "StartTime": 1595982984.068,
            "EndTime": 1595982989.994,
            "StatusMessage": "Uploading generated training model"
        },

Here's what I've tried so far:

  1. I've tried uploading it to a different bucket. I figured my permissions were the problem so I pointed it to one that I new allowed me to upload as I had done it before to that bucket. No dice.
  2. I tried backporting the script to Python 2.x, but that caused more problems than it probably would have solved, and I don't really see how that would be the problem anyways.
  3. I made sure the Notebook's IAM role has sufficient permissions, and it does have a SagemakerFullAccess policy

What bothers me is that there's no error log I can see. If I could be directed to that I would be happy too, but if there's some hidden Sagemaker kungfu that I don't know about I would be forever grateful.

asked 4 years ago1814 views
2 Answers
0

I fixed it! My training script was not actually saving to the proper folder in /opt/ml/model.

answered 4 years ago
0

Hello,

As mentioned here https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html the final model should be written to /opt/ml/model by the algorithm in order to successfully upload it to S3 as a single object in compressed tar format.

AWS
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions