Bring my own custom LLM and tokenizer to AWS Sagemaker for fine-tuning?


I have a question regarding:

  • Llama 2 fine-tuning, specifically on AWS Sagemaker
  • Masked Language Modelling (MLM)
  • Instruction tuning/Supervised fine tuning

TLDR: Is there a way I can do MLM training within AWS Sagemaker, specifically with LLama 2 model

  • More specifically, is there a way to set tokenizer.mask_token = "<mask>" or apply an attention_mask otherwise in order to do MLM?

I have the following difficulties w.r.t. fine-tuning the model:

  • Because of my infrastructure constraints, I have to use AWS Sagemaker for training
  • There are two main ways to do the training AFAIK
    • Through the GUI in Sagemaker
    • Using a notebook where you have a bit more flexibility

I want to use Masked Language Modelling (MLM) where essentially the training data looks like this: "input": "Alex is living in <mask>. "output": "London"

  • However, Llama 2 does not come with a built-in mask_token

Running tokenizer.mask_token gives: "Using mask_token, but it is not set yet.”`

This is also the case according to this video and repo

There is no obvious way how I can add this mask_token in the AWS GUI or the Notebook example provided in Sagemaker

The from sagemaker.jumpstart.estimator import JumpStartEstimator JumpStartEstimator object does not seem to have this option

What I was thinking is:

  • Download the ** lama-7b-hf model**
  • Update its tokenizer and model to include the missing mask_token
  • Tar this model to a model.tar.gz
  • Upload the tarred model to an S3 bucket
  • Initialize this model to an estimator object
  • Fit with my data

But I am not sure how I can point the JumpStartEstimator to the model which I would load to S3 so that it has the correct mask token.

Any help is greatly appreciated!

You have a blog post describing all what you want to do including fine-tuning in all details: please, see



answered 5 months ago

