I have a question regarding:
- Llama 2 fine-tuning, specifically on AWS Sagemaker
- Masked Language Modelling (MLM)
- Instruction tuning/Supervised fine tuning
TLDR:
Is there a way I can do MLM training within AWS Sagemaker, specifically with LLama 2 model
- More specifically, is there a way to set
tokenizer.mask_token = "<mask>"
or apply an attention_mask otherwise in order to do MLM?
I have the following difficulties w.r.t. fine-tuning the model:
- Because of my infrastructure constraints, I have to use AWS Sagemaker for training
- There are two main ways to do the training AFAIK
- Through the GUI in Sagemaker
- Using a notebook where you have a bit more flexibility
I want to use Masked Language Modelling (MLM) where essentially the training data looks like this:
"input": "Alex is living in <mask>. "output": "London"
- However, Llama 2 does not come with a built-in
mask_token
Running tokenizer.mask_token
gives: "Using mask_token, but it is not set yet.”`
This is also the case according to this video and repo
There is no obvious way how I can add this mask_token in the AWS GUI or the Notebook example provided in Sagemaker
The from sagemaker.jumpstart.estimator import JumpStartEstimator
JumpStartEstimator object does not seem to have this option
What I was thinking is:
- Download the ** lama-7b-hf model**
- Update its tokenizer and model to include the missing
mask_token
- Tar this model to a
model.tar.gz
- Upload the tarred model to an S3 bucket
- Initialize this model to an estimator object
- Fit with my data
But I am not sure how I can point the JumpStartEstimator to the model which I would load to S3 so that it has the correct mask token.
Any help is greatly appreciated!