How to log Sagemaker metrics beyond the 40 metrics constraint?


How can we log metrics for a training job beyond the 40 metrics limit?

Tried to create a training job for an NER (Named Entity Recognition) task. As I have a lot of classes (e.g. PER, ORG, LOC, etc.), I would like to be able to log the associated metrics accordingly. However, when I run the, it threw the following error.

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value '[MetricDefinition(...), ...]' at 'algorithmSpecification.metricDefinitions' failed to satisfy constraint: Member must have length less than or equal to 40
asked 3 months ago18 views
1 Answer
Accepted Answer

This is consistent with the CreateTrainingJob API doc, and to my knowledge it's a hard (non-adjustable) limit. However if you have a strong requirement, it may be worth raising a support case to double-check whether an increase is possible?

You could consider logging additional metrics to CloudWatch directly from your training script via the CloudWatch APIs / boto3 if needed? I expect there'd be some limitations in where the metrics are visible (e.g. showing on the training job details page in SageMaker console? showing in the Experiments & Trials view in SageMaker Studio?) - but if you were able to get them logged under the same /aws/sagemaker/TrainingJobs/{TrainingJobName} namespace as the auto-collected metrics, they might reflect. Your script code should be able to determine the current training job name from the TRAINING_JOB_NAME environment variable if wanting to try this.

Be aware that (while fast), metric data API calls can take some time: In an ideal world you might do them asynchronously to avoid slowing down your training job.

answered 3 months ago
  • This sounds like a great limitation for experiment tracking.. Will have to find another workaround..

    I have actually separately tried to manually log metrics within a Sagemaker training job, but encountered

    <stderr>:WARNING:root:Cannot write metrics in this environment.

    code is

    metrics = dict(a=1,b=2)
    with Tracker.load(training_job_name=sm_job_name) as tracker:
        for metric, value in metrics.items():
            tracker.log_metric(metric, value)

    Of course, there's a couple of other factors e.g. (1) I was running distributed training, (2) run huggingface Trainer API

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions