- 最新
- 最多得票
- 最多評論
This is consistent with the CreateTrainingJob API doc, and to my knowledge it's a hard (non-adjustable) limit. However if you have a strong requirement, it may be worth raising a support case to double-check whether an increase is possible?
You could consider logging additional metrics to CloudWatch directly from your training script via the CloudWatch APIs / boto3 if needed? I expect there'd be some limitations in where the metrics are visible (e.g. showing on the training job details page in SageMaker console? showing in the Experiments & Trials view in SageMaker Studio?) - but if you were able to get them logged under the same /aws/sagemaker/TrainingJobs/{TrainingJobName}
namespace as the auto-collected metrics, they might reflect. Your script code should be able to determine the current training job name from the TRAINING_JOB_NAME environment variable if wanting to try this.
Be aware that (while fast), metric data API calls can take some time: In an ideal world you might do them asynchronously to avoid slowing down your training job.
相關內容
- AWS 官方已更新 2 年前
- AWS 官方已更新 1 年前
- AWS 官方已更新 2 年前
- AWS 官方已更新 1 年前
This sounds like a great limitation for experiment tracking.. Will have to find another workaround..
I have actually separately tried to manually log metrics within a Sagemaker training job, but encountered
code is
Of course, there's a couple of other factors e.g. (1) I was running distributed training, (2) run huggingface Trainer API