By using AWS re:Post, you agree to the Terms of Use

Training Metric logging on SageMaker experiment tracking: how to get time-series metrics with visualisation


I am using the sagemaker python SDK to train a bespoke model. I have defined my metric_definition regexes and passed them to the estimator like:

num_re = "([0-9\\.]+)(e-?[[01][0-9])?"
metrics = [
    {"Name": "learning-rate", "Regex": f"lr: {num_re}"},
    {"Name": "training:loss", "Regex": f"loss: {num_re}"},
    # ...
estimator = Estimator(
    # ...

When I run training, these metrics are visible in my logs and I can also see them in SageMaker Studio in Trial Components > Metrics (tab) as a grid of numbers like:

Name | Minimum | Maximum | Standard Deviation | Average | Count | Final value

learning-rate | 8.889 | 8.907 | 0.010392304845413657 | 8.898 | 4 |8.907


Which suggests that the regexes are correctly matching on the logs

However, I am not able to visualise any graphs for my metrics. I have tried all of:

  • Sagemaker Studio > Trial components > charts. It is only possible to plot things like learning-rate_min (i.e. a point value not a time-series metric)
  • SageMaker aws console > training > training jobs > <select job> > Scroll to Monitor section. Here I can see metrics like CPUUtilization over time but for my metrics there is just an empty graph for each metric that I have defined that says 'No data available'
  • SageMaker aws console > training > training jobs > <select job> > Scroll to Monitor section > View algorithm metrics (opens in CloudWatch) > Browse > select metric (e.g. learning-rate and 'Add to Graph' . I filter by the correct time period and go the Graphed metrics (1) tab, even after updating the period to 1 second I am not able to see anything on the graph.

I'm not sure what the issue is here but any help would be much appreciated

2 Answers

Hello, Thank you for contacting us.

I understand you are not able to visualize any graphs for your metrics, even though you see them in "Trial Components > Metrics (tab)".

SageMaker parses the Cloudwatch logs for your training job and emits metrics from the parsed logs as defined in the metrics_definition. The Cloudwatch logs for your training job depends on your script. For example, If you wish to have metrics per step (or per 100 steps), your script needs to print the metrics per step (or per 100 steps) so that it is there in the Cloudwatch logs. Please see the documentation linked below for more information.

And I am providing you with examples from our official AWS Github Repository in [1] and [2] which provides an Entry Script which emits custom metrics for a Hyper-parameter Tuning Job. Please compare them with your script. [1] [2]

If you still have difficulties, I recommend to cut a support case and provide more detail about your account information and script/config. Due to security reason, we cannot discuss account specific issue in the public posts.

Thank you.

answered 6 months ago

Hi - this is the OP:

Thanks for your response. Yes I should have stated in the original question: I am logging these metrics to the console every iteration and can see them in view logs in the console. The issue is that I'm not able to:

  1. view the parsed metrics for the period (I can only see the mean/max/min/...)
  2. get visualisations of these metrics
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions