Sagemaker Hyperparameter Tuning

0

Hi Everyone,

Context on the problem: I am training an SK Learn pipeline model with custom pre-processing and an XGBRegressor model under an SKLearn Estimator image in a custom bring my own training script framework. I have a training and validation split of my data that I am able to load in my training script, train the pipeline model, make predictions on training and validation sets, and also able to log the evaluation metrics like train:mae, val:mae etc... I am also able to launch a hyper-parameter tuning job with my custom metric-definitions and the job finishes successfully. The objective I choose to minimize is validaiton:mae with "Bayesian" optimization strategy.

Questions:

  1. Why does each training job associated with the tuning job have a 'minimum', 'maximum', 'average', 'std-dev' of the metrics? I do not perform any cross-validation in my custom training script, so where does sagemaker get these numbers?
  2. Why does the validation error that the metrics table log the minimum value? and not the average value? How does this not lead to overfitting if we drive the tuning objective to minimum while selecting the minimum error split from internal cross-validation.

Please help me understand what's going on in sagemaker hyper-parameter tuning metric logs and objective tuning. I understand bayesian optimization and sagemaker seems to be doing the optimization correctly, but it seems that sagemaker is not using the correct numbers to optimize.

![Enter image description here ![Enter image description here

1 Answer
0

SageMaker training job metrics are time series: Your job can log multiple values of e.g. train:mae over time as it trains, which is useful for long-running training jobs to continuously report metrics for monitoring (and maybe trigger early stopping).

This is why metrics are generally described by summary statistics. You can usually see these time series charts in Run > Charts within Studio, or from the training job details page in the AWS Console... But if your training job is very short, the graphs might not be that interesting: I believe they aggregate data points to a 1min or 5min granularity by default.

So which is the important statistic to look at? Usually 'Final value', but it depends what your script is doing.

For example if you're training a model with checkpointing and automatic stopping, it could be that accuracy gets worse for a few iterations before the script detects the issue, stops training, and re-loads the best-performing model from checkpoint? In that case, either you could make sure your script re-logs the final accuracy score so that "Final" is consistent with the final model... Or could just refer to "Max".

Alternatively if you have a script that does cross-validation and you wanted the summary statistics to accurately reflect this (e.g. standard deviation is the deviation of accuracy between different folds, average is the average over folds, etc) - then you would want to make sure your script logged the metrics exactly once for each validation fold, without any repetitions during the training process. Nice for cross-validation statistics, but those metrics then wouldn't give you continuous insight into the model as it trained (if that's something you want).

I couldn't find either 135.2... or 186.2... in your Runs table's OptimizationMetric column, so I think your second screenshot may be omitting the record for the run shown in the first screenshot? But the Final value 135.2... is the one I'd usually expect to see listed in the summary value.

AWS
EXPERT
Alex_T
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions