Sagemaker - MaxWaitTimeInSeconds Error when running Tensorflow Job

0

Hi everyone, I have a problem running my Training Job as a spot instance. Given that, if I run it normally, everything works properly, when I try to add MaxWaitTimeInSeconds , MaxRuntimeInSeconds , and use_spot_instances I get the following error:

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Invalid MaxWaitTimeInSeconds. It must be present and be greater than or equal to MaxRuntimeInSeconds

This happens independently from the value I assign to them, I tried every combination, including MaxWaitTimeInSeconds greater than or equal to MaxRuntimeInSeconds. For example, this code returns the same error:

`tf2_estimator = TensorFlow( entry_point="AE_idsqty_train_spot_ordinal.py", dependencies=["config.py", "vocabulary.xlsx", "requirements.txt"], source_dir=".", role=sagemaker.get_execution_role(), instance_count=1, instance_type="ml.g4dn.xlarge", framework_version="2.4.1", hyperparameters=hyperparams, py_version="py37", metric_definitions=metric_definitions, enable_sagemaker_metrics=True, tags=[],

MaxWaitTimeInSeconds = (9*60*60), #Max waiting time for a new spot instance
MaxRuntimeInSeconds = (3*60*60), #Max execution time for a new spot instance
use_spot_instances = True, #Train using the spot instances

checkpoint_s3_uri = f"s3://{bucket}/checkpoints"

)`

I really cannot figure out the problem, I suppose it can be some version or bug problem. Could someone help me? Thank you

asked a year ago290 views
1 Answer
1

In general when we set 'train_max_wait' arguments in your job to less than 'train_max_run' we get this exception. As train_max_wait can be set only if train_use_spot_instances is True and must be greater than or equal to train_max_run.

I have reproduced the error on my end with the above arguments. For ex.

========= train_use_spot_instances = True train_max_run=3700 train_max_wait = 3700 if train_use_spot_instances else None. // this one worked successfully

========= train_use_spot_instances = True train_max_run=3700 train_max_wait = 3600 if train_use_spot_instances else None // this one failed with above error i.e.( ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Invalid MaxWaitTimeInSeconds. It must be present and be greater than or equal to MaxRuntimeInSeconds)

============

Hence to mitigate this error please confirm the arguments train_max_wait must be greater than or equal to train_max_run from your job config, for more information please refer sample notebook: https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/managed_spot_training_object_detection/managed_spot_training_object_detection.ipynb

refer docs : https://aws.amazon.com/getting-started/hands-on/managed-spot-training-sagemaker/

AWS
answered a year ago
  • I tried to follow them and use your same values but the same error appears. I tried to update Sagemaker as first cell in the notebook, but it seems to not have any effect. I work with sagemaker.version = 2.168.0

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions