Waiter TrainingJobCompletedOrStopped failed: Waiter encountered a terminal

0

I am trying to launch a job using the low level api in boto3 sagemaker client. After calling sagemaker.create_training_job(**params) I try to get a waiter. This code is directly from the documentation for creating a training job (https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html)
I get this error:

Traceback (most recent call last):
  File "traindeploy.py", line 97, in create_training_job
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
  File "/path/to/lib/Python/3.6/lib/python/site-packages/botocore/waiter.py", line 53, in wait
    Waiter.wait(self, **kwargs)
  File "/path/to/lib/Python/3.6/lib/python/site-packages/botocore/waiter.py", line 323, in wait
    last_response=response,
botocore.exceptions.WaiterError: Waiter TrainingJobCompletedOrStopped failed: Waiter encountered a terminal failure state

These are my job params:

{
  "AlgorithmSpecification": {
    "TrainingImage": "<image-url-from-ecr>",
    "TrainingInputMode": "File"
  },
  "RoleArn": "<role-arn>",
  "OutputDataConfig": {
    "S3OutputPath": "s3://path-to-bucket/some-folder-output/"
  },
  "ResourceConfig": {
    "InstanceCount": 2,
    "InstanceType": "ml.c4.8xlarge",
    "VolumeSizeInGB": 50
  },
  "TrainingJobName": "some-jobname",
  "HyperParameters": {},
  "StoppingCondition": {
    "MaxRuntimeInSeconds": 3600
  },
  "InputDataConfig": [
    {
      "ChannelName": "train",
      "DataSource": {
        "S3DataSource": {
          "S3DataType": "S3Prefix",
          "S3Uri": "s3://path-to-bucket/some-folder-input/",
          "S3DataDistributionType": "FullyReplicated"
        }
      },
      "CompressionType": "None",
      "RecordWrapperType": "None"
    }
  ]
}

Can someone please advise what is causing this and how will I get a waiter on a training job?

rks
asked 6 years ago1116 views
5 Answers
0

Hello rks,

You received that error message because the waiter you created was expecting the training job to finish in either a Completed or Stopped state, but it reached the Failed state instead. This means there was an issue with your job and it could not be completed correctly.

To understand what caused your training job to fail, you can follow these steps:

  1. Go to the Amazon SageMaker console in your account
  2. Click on "Jobs" in the navigation bar to the left
  3. Search for your training job in the list. You can use the "Search jobs" text form to quickly find the job by its name, or filter them by status.
  4. The training job details should explain why it failed.

If you need more help with this issue, please don't hesitate to contact us.

Best regards,
Amazon SageMaker team

answered 6 years ago
0

Hi Rodrigo,

Thanks for looking into this. Actually the training job does eventually succeed. I tried putting time.wait(seconds=N) before I get call get_waiter, but no matter what wait time I chose the waiter still failed giving the same exception. I waited 5 seconds, I waited 2 minutes, I even waited 3 minutes which was more than the time my training job took to complete successfully, but I always got the same exception.

So, I want to emphasize that there is nothing wrong with the training jobs I create. They succeed. But the get_waiter method always fails for me. What am I doing wrong here?

Edited by: rks on May 17, 2018 4:34 PM

rks
answered 6 years ago
0

Hi rks,

I'm sorry for the confusion, I misunderstood your issue. I attempted to reproduce it by running the example code myself in a SageMaker notebook, but I was able to run the job and the waiter worked correctly. For the record, I used the low-level KMeans MNIST sample notebook that comes bundled with all SageMaker notebooks. You can find it in "/sample-notebooks/sagemaker-python-sdk/1P_kmeans_lowlevel/kmeans_mnist_lowlevel.ipynb". That notebook should be very similar to the code you tried to run.

Could you tell us more about how the environment you're running the job in? In particular, we'd like to know the version of Python and BOTO you're using. You can run the following commands to get them:

import sys, boto3
print("boto version = " + boto3.__version__)
print("python version = " + sys.version)

Thank you for your patience.

Rodrigo

answered 6 years ago
0

Hi Roberto,

Thank you for helping me out. This is the version of boto and python I am running:

16:30 $ python3.6
Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import sys, boto3
print("boto version = " + boto3.__version__)
    boto version = 1.5.18
print("python version = " + sys.version)
    python version = 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28) 
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

I'll review my code based on the notebook you've pointed out.

Edited by: rks on May 18, 2018 4:35 PM

Edited by: rks on May 18, 2018 4:35 PM

Edited by: rks on May 18, 2018 4:36 PM

rks
answered 6 years ago
0

Hi rks,

We have tried to troubleshoot this issue but it doesn't seem like it's reproducible. It's likely that the training-job name was invalid. Is this issue still occurring?

Thanks,
Ingrid

answered 6 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions