Sagemaker training for multiclass classification run does not store the trained model

0

Hi,

I have trained a multiclass classification model using auto-ml.

Used

  • Training image: sagemaker-xgboost:1.3-1-cpu-py
  • Instance type: ml.m5.12xlarge

The run succeeded to complete 2 cross validation folds before the time limit was reached. The resulting best model was not stored in the specified s3 location. The job is configured to store the model on termination.

In parallel I have trained other classifications with the same auto-ml template (jupyter NB) successfully, so I don't think it is a configuration or permission issue.

The main difference for this classification training is the higher number of labels, which is 1950. The allowed label limit for this algorithm is 2000.

I also repeated the run for this model candidate 2 times with the same result: that the model was not stored.

CloudWatch has no entries regarding problems to create or store the model.

Thanks, Arthur

1 Risposta
0
Risposta accettata

I solved the problem on my own:

  • reduced the number of folds to reduce the time the algorithm needs to finish. ( set hyperparameter _kfold: 2 )

  • Another possibility would be to increase the time the algorithm is allowed to run to let the algorithm finish.

After giving the algorithm enough time to finish, it completed and also stored the model in s3.

So the problem was to store the model on termination: I suppose the default time of 120 seconds was not enough.

arthur
con risposta 2 anni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande