based on the sample/docs provided here => https://www.philschmid.de/mlops-sagemaker-huggingface-transformers, I am fine tuning a hugging face distilbert model in sagemaker studio via pipeline. this example works. when the model is created , i specify entry_point = 'predict.py',
and source_dir = 'script' (see below ) , which creates following directory structure in the model.tar file. there are other files like tokenizer.json , tokenizer_config.json. , is it possible put these files into another folder next to my script folder during the model create/package step ? these files , i assume are downloaded from hugging face along with the pytorch model file and are put at the root of the tar model file generated.
model directory structure
model.tar.gz/
|- pytorch_model.bin
|- tokenizer.json
|- tokenizer_config.json
|- special_tokens_map.json
|- ...
|- script/
|- predict.py
|- requirements.txt
# Create Model
model = Model(
entry_point = 'predict.py',
source_dir = 'script'
huggingface_estimator = HuggingFace(entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
hyperparameters = hyperparameters)
step_train = TrainingStep(
name="TrainHuggingFaceModel",
estimator=huggingface_estimator,
inputs={
"train": TrainingInput( ... ),
"test": TrainingInput( ... ),
},
....
)
# Create Model
model = Model(
entry_point = 'predict.py',
source_dir = 'script'
image_uri=image_uri,
model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
sagemaker_session=pipeline_session,
role=role,
....
)
step_create_model = ModelStep(
name="CreateModel",
step_args=model.create("ml.m4.large"),
)