Incremental training with custom keras code in script mode

0

Hi,
I'm moving my first steps in sagemaker. I'm using script mode to train a classification algorithm. Training is fine, however I'm not able to do incremental training. I want to train again the same model with new data. Here what I did. This is my script

import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker import get_execution_role

bucket = 'sagemaker-blablabla'
train_data = 's3://{}/{}'.format(bucket,'train')
validation_data = 's3://{}/{}'.format(bucket,'test')

s3_output_location = 's3://{}'.format(bucket)

tf_estimator = TensorFlow(entry_point='main.py', 
                          role=get_execution_role(),
                          train_instance_count=1, 
                          train_instance_type='ml.p2.xlarge',
                          framework_version='1.12', 
                          py_version='py3',
                          output_path=s3_output_location)

inputs = {'train': train_data, 'test': validation_data}
tf_estimator.fit(inputs)

The entry point is my custom keras code, which I adapted to receive arguments from the script.
Now the training is successfully completed and I have in my s3 bucket the model.tar.gz. I want to train again, but it's not clear to me how to do it.. I tried this

trained_model = 's3://sagemaker-blablabla/sagemaker-tensorflow-scriptmode-2019-11-27-12-01-42-300/output/model.tar.gz'

tf_estimator = sagemaker.estimator.Estimator(image_name='blablabla-west-1.amazonaws.com/sagemaker-tensorflow-scriptmode:1.12-gpu-py3', 
                                              role=get_execution_role(),
                                              train_instance_count=1, 
                                              train_instance_type='ml.p2.xlarge',
                                              output_path=s3_output_location,
                                              model_uri = trained_model)

inputs = {'train': train_data, 'test': validation_data}

tf_estimator.fit(inputs)

Doesn't work.. first I don't know how to retrieve the training image name (for this I looked for it in the aws console, but I guess there should be a smarter solution), second this code throws an exception about the entry point.. but it is my understanding that I shouldn't need it when I do incremental learning with a ready image..
I'm surely missing something important, any help? Thank you

rokk07
已提問 4 年前檢視次數 408 次
1 個回答
0

Found by myself, here the answer: https://stackoverflow.com/a/59096925/4267439

rokk07
已回答 4 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南