how to create a custom model tar file via create model step in sagemaker pipeline step?

0

based on the sample/docs provided here => https://www.philschmid.de/mlops-sagemaker-huggingface-transformers, I am fine tuning a hugging face distilbert model in sagemaker studio via pipeline. this example works. when the model is created , i specify entry_point = 'predict.py', and source_dir = 'script' (see below ) , which creates following directory structure in the model.tar file. there are other files like tokenizer.json , tokenizer_config.json. , is it possible put these files into another folder next to my script folder during the model create/package step ? these files , i assume are downloaded from hugging face along with the pytorch model file and are put at the root of the tar model file generated.

model directory structure

model.tar.gz/
|- pytorch_model.bin
|- tokenizer.json
|- tokenizer_config.json
|- special_tokens_map.json
|- ...
|- script/
   |- predict.py
   |- requirements.txt 
# Create Model
model = Model(
    entry_point = 'predict.py', 
    source_dir = 'script'
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

step_train = TrainingStep(
    name="TrainHuggingFaceModel",
    estimator=huggingface_estimator,
    inputs={
        "train": TrainingInput( ...  ),
        "test": TrainingInput( ...     ),
    },
....
)

# Create Model
model = Model(
    entry_point = 'predict.py', 
    source_dir = 'script'
    image_uri=image_uri,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    sagemaker_session=pipeline_session,
    role=role,
    ....
)

step_create_model = ModelStep(
    name="CreateModel",
    step_args=model.create("ml.m4.large"),
)
preguntada hace un año639 visualizaciones
1 Respuesta
0

You are correct these JSON files come with distilbert-base-uncased - https://huggingface.co/distilbert-base-uncased/tree/main

Im not familiar with the model training process but It might be that these were used to train the model initially - but are then not required later for a deployed endpoint - I can see from https://github.com/huggingface/transformers/blob/3f936df66287f557c6528912a9a68d7850913b9b/src/transformers/models/bert/tokenization_bert_fast.py - distillation process where they are referenced.

These files are used here and it might be possible to train from another location initially

profile pictureAWS
robbrad
respondido hace un año

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas