- Newest
- Most votes
- Most comments
This is consistent with the CreateTrainingJob API doc, and to my knowledge it's a hard (non-adjustable) limit. However if you have a strong requirement, it may be worth raising a support case to double-check whether an increase is possible?
You could consider logging additional metrics to CloudWatch directly from your training script via the CloudWatch APIs / boto3 if needed? I expect there'd be some limitations in where the metrics are visible (e.g. showing on the training job details page in SageMaker console? showing in the Experiments & Trials view in SageMaker Studio?) - but if you were able to get them logged under the same /aws/sagemaker/TrainingJobs/{TrainingJobName}
namespace as the auto-collected metrics, they might reflect. Your script code should be able to determine the current training job name from the TRAINING_JOB_NAME environment variable if wanting to try this.
Be aware that (while fast), metric data API calls can take some time: In an ideal world you might do them asynchronously to avoid slowing down your training job.
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 4 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 5 days ago
This sounds like a great limitation for experiment tracking.. Will have to find another workaround..
I have actually separately tried to manually log metrics within a Sagemaker training job, but encountered
code is
Of course, there's a couple of other factors e.g. (1) I was running distributed training, (2) run huggingface Trainer API