how to create a custom model tar file via create model step in sagemaker pipeline step?

0

based on the sample/docs provided here => https://www.philschmid.de/mlops-sagemaker-huggingface-transformers, I am fine tuning a hugging face distilbert model in sagemaker studio via pipeline. this example works. when the model is created , i specify entry_point = 'predict.py', and source_dir = 'script' (see below ) , which creates following directory structure in the model.tar file. there are other files like tokenizer.json , tokenizer_config.json. , is it possible put these files into another folder next to my script folder during the model create/package step ? these files , i assume are downloaded from hugging face along with the pytorch model file and are put at the root of the tar model file generated.

model directory structure

model.tar.gz/
|- pytorch_model.bin
|- tokenizer.json
|- tokenizer_config.json
|- special_tokens_map.json
|- ...
|- script/
   |- predict.py
   |- requirements.txt 
# Create Model
model = Model(
    entry_point = 'predict.py', 
    source_dir = 'script'
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

step_train = TrainingStep(
    name="TrainHuggingFaceModel",
    estimator=huggingface_estimator,
    inputs={
        "train": TrainingInput( ...  ),
        "test": TrainingInput( ...     ),
    },
....
)

# Create Model
model = Model(
    entry_point = 'predict.py', 
    source_dir = 'script'
    image_uri=image_uri,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    sagemaker_session=pipeline_session,
    role=role,
    ....
)

step_create_model = ModelStep(
    name="CreateModel",
    step_args=model.create("ml.m4.large"),
)
質問済み 1年前639ビュー
1回答
0

You are correct these JSON files come with distilbert-base-uncased - https://huggingface.co/distilbert-base-uncased/tree/main

Im not familiar with the model training process but It might be that these were used to train the model initially - but are then not required later for a deployed endpoint - I can see from https://github.com/huggingface/transformers/blob/3f936df66287f557c6528912a9a68d7850913b9b/src/transformers/models/bert/tokenization_bert_fast.py - distillation process where they are referenced.

These files are used here and it might be possible to train from another location initially

profile pictureAWS
robbrad
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ