Training Metric logging on SageMaker experiment tracking: how to get time-series metrics with visualisation

0

I am using the sagemaker python SDK to train a bespoke model. I have defined my metric_definition regexes and passed them to the estimator like:

num_re = "([0-9\\.]+)(e-?[[01][0-9])?"
metrics = [
    {"Name": "learning-rate", "Regex": f"lr: {num_re}"},
    {"Name": "training:loss", "Regex": f"loss: {num_re}"},
    # ...
]
estimator = Estimator(
    image_uri=training_image_uri,
    # ...
    metric_definitions=metrics,
    enable_sagemaker_metrics=True,
)

When I run training, these metrics are visible in my logs and I can also see them in SageMaker Studio in Trial Components > Metrics (tab) as a grid of numbers like:

Name | Minimum | Maximum | Standard Deviation | Average | Count | Final value

learning-rate | 8.889 | 8.907 | 0.010392304845413657 | 8.898 | 4 |8.907

...

Which suggests that the regexes are correctly matching on the logs

However, I am not able to visualise any graphs for my metrics. I have tried all of:

  • Sagemaker Studio > Trial components > charts. It is only possible to plot things like learning-rate_min (i.e. a point value not a time-series metric)
  • SageMaker aws console > training > training jobs > <select job> > Scroll to Monitor section. Here I can see metrics like CPUUtilization over time but for my metrics there is just an empty graph for each metric that I have defined that says 'No data available'
  • SageMaker aws console > training > training jobs > <select job> > Scroll to Monitor section > View algorithm metrics (opens in CloudWatch) > Browse > select metric (e.g. learning-rate and 'Add to Graph' . I filter by the correct time period and go the Graphed metrics (1) tab, even after updating the period to 1 second I am not able to see anything on the graph.

I'm not sure what the issue is here but any help would be much appreciated

2개 답변
0

Hi - this is the OP:

Thanks for your response. Yes I should have stated in the original question: I am logging these metrics to the console every iteration and can see them in view logs in the console. The issue is that I'm not able to:

  1. view the parsed metrics for the period (I can only see the mean/max/min/...)
  2. get visualisations of these metrics
답변함 2년 전
  • Were you able to find a way to visualize experiments metrics ? I just started using SM experiments and can't find a way to visualize metrics !

0

Hello, Thank you for contacting us.

I understand you are not able to visualize any graphs for your metrics, even though you see them in "Trial Components > Metrics (tab)".

SageMaker parses the Cloudwatch logs for your training job and emits metrics from the parsed logs as defined in the metrics_definition. The Cloudwatch logs for your training job depends on your train.py script. For example, If you wish to have metrics per step (or per 100 steps), your script needs to print the metrics per step (or per 100 steps) so that it is there in the Cloudwatch logs. Please see the documentation linked below for more information.

And I am providing you with examples from our official AWS Github Repository in [1] and [2] which provides an Entry Script which emits custom metrics for a Hyper-parameter Tuning Job. Please compare them with your script. [1] https://github.com/aws/amazon-sagemaker-examples/tree/master/hyperparameter_tuning/tensorflow2_mnist [2] https://github.com/aws/amazon-sagemaker-examples/tree/master/hyperparameter_tuning/tensorflow_mnist

If you still have difficulties, I recommend to cut a support case and provide more detail about your account information and script/config. Due to security reason, we cannot discuss account specific issue in the public posts.

Thank you.

답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인