How to log Sagemaker metrics beyond the 40 metrics constraint?

0

How can we log metrics for a training job beyond the 40 metrics limit?

Tried to create a training job for an NER (Named Entity Recognition) task. As I have a lot of classes (e.g. PER, ORG, LOC, etc.), I would like to be able to log the associated metrics accordingly. However, when I run the estimator.fit(), it threw the following error.

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value '[MetricDefinition(...), ...]' at 'algorithmSpecification.metricDefinitions' failed to satisfy constraint: Member must have length less than or equal to 40
gefragt vor 2 Jahren206 Aufrufe
1 Antwort
0
Akzeptierte Antwort

This is consistent with the CreateTrainingJob API doc, and to my knowledge it's a hard (non-adjustable) limit. However if you have a strong requirement, it may be worth raising a support case to double-check whether an increase is possible?

You could consider logging additional metrics to CloudWatch directly from your training script via the CloudWatch APIs / boto3 if needed? I expect there'd be some limitations in where the metrics are visible (e.g. showing on the training job details page in SageMaker console? showing in the Experiments & Trials view in SageMaker Studio?) - but if you were able to get them logged under the same /aws/sagemaker/TrainingJobs/{TrainingJobName} namespace as the auto-collected metrics, they might reflect. Your script code should be able to determine the current training job name from the TRAINING_JOB_NAME environment variable if wanting to try this.

Be aware that (while fast), metric data API calls can take some time: In an ideal world you might do them asynchronously to avoid slowing down your training job.

AWS
EXPERTE
Alex_T
beantwortet vor 2 Jahren
  • This sounds like a great limitation for experiment tracking.. Will have to find another workaround..

    I have actually separately tried to manually log metrics within a Sagemaker training job, but encountered

    <stderr>:WARNING:root:Cannot write metrics in this environment.
    

    code is

    metrics = dict(a=1,b=2)
    with Tracker.load(training_job_name=sm_job_name) as tracker:
    
        for metric, value in metrics.items():
            tracker.log_metric(metric, value)
    

    Of course, there's a couple of other factors e.g. (1) I was running distributed training, (2) run huggingface Trainer API

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen