- Newest
- Most votes
- Most comments
Hi plamd, I have not figured out a solution yet, but am going to try Hugging Face Llama-2 model training in SageMaker. I am suspecting this is an issue with the llama-2 with jumpstart. see this: https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html good luck!
@reza - in my case, this seems to happen when a validation testset is explicitly specified. When I omit the validation testset and just use a training test set, the training runs are passing (some portion of the training set will be used for validation - controlled via validation_split_ratio hyperparam). This is quite limiting (and the error is really misleading) but it's the only way I've been able to get this working.
@plamd thanks for the info, really useful. I need to be able to select my test set manually. I have submitted a formal case, will let you know if I learn something new.
Relevant content
- asked 2 years ago
- Accepted Answerasked 10 months ago
- asked 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 months ago
Also getting this when I try to fine-tune a Llama2 chat model via the Sagemaker Jumpstart Studio UI (tried with the 7b and 70b chat variants). Here is the stacktrace I get:
For the 70b model, the training fails after ~38 minutes and it seems we do get billed for that time.
Any ideas whether this is wrong error reporting or a bug on sagemaker side?