how to create a custom model tar file via create model step in sagemaker pipeline step?

0

based on the sample/docs provided here => https://www.philschmid.de/mlops-sagemaker-huggingface-transformers, I am fine tuning a hugging face distilbert model in sagemaker studio via pipeline. this example works. when the model is created , i specify entry_point = 'predict.py', and source_dir = 'script' (see below ) , which creates following directory structure in the model.tar file. there are other files like tokenizer.json , tokenizer_config.json. , is it possible put these files into another folder next to my script folder during the model create/package step ? these files , i assume are downloaded from hugging face along with the pytorch model file and are put at the root of the tar model file generated.

model directory structure

model.tar.gz/
|- pytorch_model.bin
|- tokenizer.json
|- tokenizer_config.json
|- special_tokens_map.json
|- ...
|- script/
   |- predict.py
   |- requirements.txt 
# Create Model
model = Model(
    entry_point = 'predict.py', 
    source_dir = 'script'
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

step_train = TrainingStep(
    name="TrainHuggingFaceModel",
    estimator=huggingface_estimator,
    inputs={
        "train": TrainingInput( ...  ),
        "test": TrainingInput( ...     ),
    },
....
)

# Create Model
model = Model(
    entry_point = 'predict.py', 
    source_dir = 'script'
    image_uri=image_uri,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    sagemaker_session=pipeline_session,
    role=role,
    ....
)

step_create_model = ModelStep(
    name="CreateModel",
    step_args=model.create("ml.m4.large"),
)
asked a year ago618 views
1 Answer
0

You are correct these JSON files come with distilbert-base-uncased - https://huggingface.co/distilbert-base-uncased/tree/main

Im not familiar with the model training process but It might be that these were used to train the model initially - but are then not required later for a deployed endpoint - I can see from https://github.com/huggingface/transformers/blob/3f936df66287f557c6528912a9a68d7850913b9b/src/transformers/models/bert/tokenization_bert_fast.py - distillation process where they are referenced.

These files are used here and it might be possible to train from another location initially

profile pictureAWS
robbrad
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions