- Newest
- Most votes
- Most comments
Hello rks,
You received that error message because the waiter you created was expecting the training job to finish in either a Completed or Stopped state, but it reached the Failed state instead. This means there was an issue with your job and it could not be completed correctly.
To understand what caused your training job to fail, you can follow these steps:
- Go to the Amazon SageMaker console in your account
- Click on "Jobs" in the navigation bar to the left
- Search for your training job in the list. You can use the "Search jobs" text form to quickly find the job by its name, or filter them by status.
- The training job details should explain why it failed.
If you need more help with this issue, please don't hesitate to contact us.
Best regards,
Amazon SageMaker team
Hi Rodrigo,
Thanks for looking into this. Actually the training job does eventually succeed. I tried putting time.wait(seconds=N) before I get call get_waiter, but no matter what wait time I chose the waiter still failed giving the same exception. I waited 5 seconds, I waited 2 minutes, I even waited 3 minutes which was more than the time my training job took to complete successfully, but I always got the same exception.
So, I want to emphasize that there is nothing wrong with the training jobs I create. They succeed. But the get_waiter method always fails for me. What am I doing wrong here?
Edited by: rks on May 17, 2018 4:34 PM
Hi rks,
I'm sorry for the confusion, I misunderstood your issue. I attempted to reproduce it by running the example code myself in a SageMaker notebook, but I was able to run the job and the waiter worked correctly. For the record, I used the low-level KMeans MNIST sample notebook that comes bundled with all SageMaker notebooks. You can find it in "/sample-notebooks/sagemaker-python-sdk/1P_kmeans_lowlevel/kmeans_mnist_lowlevel.ipynb". That notebook should be very similar to the code you tried to run.
Could you tell us more about how the environment you're running the job in? In particular, we'd like to know the version of Python and BOTO you're using. You can run the following commands to get them:
import sys, boto3
print("boto version = " + boto3.__version__)
print("python version = " + sys.version)
Thank you for your patience.
Rodrigo
Hi Roberto,
Thank you for helping me out. This is the version of boto and python I am running:
16:30 $ python3.6
Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import sys, boto3
print("boto version = " + boto3.__version__)
boto version = 1.5.18
print("python version = " + sys.version)
python version = 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
I'll review my code based on the notebook you've pointed out.
Edited by: rks on May 18, 2018 4:35 PM
Edited by: rks on May 18, 2018 4:35 PM
Edited by: rks on May 18, 2018 4:36 PM
Hi rks,
We have tried to troubleshoot this issue but it doesn't seem like it's reproducible. It's likely that the training-job name was invalid. Is this issue still occurring?
Thanks,
Ingrid
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago