Deepspeed Error for fine-tuning Mistral 7b on JumpStart

0

I'm having some troubles with fine-tuning Mistral 7b on JumpStart. This is the error:

 ErrorMessage "raise ValueError(
 ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
 ERROR:root:Subprocess script failed with return code: 1
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 9, in run_with_error_handling
 subprocess.run(command, shell=shell, check=True)
 File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
 raise CalledProcessError(retcode, process.args,
 subprocess.
 CalledProcessError
 Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '1', '--gradient_accumulation_steps', '8', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--logging_steps', '8', '--warmup_ratio', '0.1', '--learning_rate', '6e-06', '--weight_decay', '0.2', '--seed', '10', '--max_input_length', '-1', '--validation_split_ratio', '0.2', '--train_data_split_seed', '0', '--max_steps', '-1', '--early_stopping_patience', '3', '--early_stopping_threshold', '0.0', '--adam_beta1', '0.9', '--adam_beta2', '0.999', '--max_grad_norm', '1.0', '--label_smoothing_factor', '0.0', '--logging_strategy', 'steps', '--save_strategy', 'steps', '--save_steps', '500', '--dataloader_num_workers', '0', '--lr_scheduler_type', 'constant_with_warmup', '--warmup_steps', '0', '--evaluation_strategy', 'steps', '--eval_steps', '20', '--lora_r', '8', '--lora_alpha', '16.0', '--lora_dropout', '0.05', '--bits', '16', '--quant_type', 'nf4', '--lora_finetuning', '--load_best_model_at_end', '--bf16', '--instruction_tuned', '--gradient_checkpointing', '--save_total_limit', '1', '--double_quant']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred
 File "/opt/ml/code/transfer_learning.py", line 68, in <module>
 run_with_args(args)
 File "/opt/ml/code/transfer_learning.py", line 42, in run_with_args
 subprocess.run_with_error_handling(command)
 File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 12, in run_with_error_handling
 raise RuntimeError(e)
 RuntimeError: Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '1', '--gradient_accumulation_steps', '8', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--logging_steps', '8', '--warmup_ratio', '0.1', '--learning_rate', '6e-06', '--weight_decay', '0.2', '--seed', '10', '--max_input_length', '-1', '--validation_split_ratio', '0.2', '--train_data_split_seed', '0', '--max_steps', '-1', '--early_stopping_patience', '3', '--early_stopping_threshold', '0.0', '--adam_beta1', '0.9', '--adam_beta2', '0.999', '--max_grad_norm', '1.0', '--label_smoothing_factor', '0.0', '--logging_strategy', 'steps', '--save_strategy', 'steps', '--save_steps', '500', '--dataloader_num_workers', '0', '--lr_scheduler_type', 'constant_with_warmup', '--warmup_steps', '0', '--evaluation_strategy', 'steps', '--eval_steps', '20', '--lora_r', '8', '--lora_alpha', '16.0', '--lora_dropout', '0.05', '--bits', '16', '--quant_type', 'nf4', '--lora_finetuning', '--load_best_model_at_end', '--bf16', '--instruction_tuned', '--gradient_checkpointing', '--save_total_limit', '1', '--double_quant']' returned non-zero exit status 1."
ErrorMessage "raise ValueError( ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`. ERROR:root:Subprocess script failed with return code: 1 Traceback (most recent call last) File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 9, in run_with_error_handling subprocess.run(command, shell=shell, check=True) File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess. CalledProcessError Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '1', '--gradient_accumulation_steps', '8', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--logging_steps', '8', '--warmup_ratio', '0.1', '--learning_rate', '6e-06', '--weight_decay', '0.2', '--seed', '10', '--max_input_length', '-1', '--validation_split_ratio', '0.2', '--train_data_split_seed', '0', '--max_steps', '-1', '--early_stopping_patience', '3', '--early_stopping_threshold', '0.0', '--adam_beta1', '0.9', '--adam_beta2', '0.999', '--max_grad_norm', '1.0', '--label_smoothing_factor', '0.0', '--logging_strategy', 'steps', '--save_strategy', 'steps', '--save_steps', '500', '--dataloader_num_workers', '0', '--lr_scheduler_type', 'constant_with_warmup', '--warmup_steps', '0', '--evaluation_strategy', 'steps', '--eval_steps', '20', '--lora_r', '8', '--lora_alpha', '16.0', '--lora_dropout', '0.05', '--bits', '16', '--quant_type', 'nf4', '--lora_finetuning', '--load_best_model_at_end', '--bf16', '--instruction_tuned', '--gradient_checkpointing', '--save_total_limit', '1', '--double_quant']' returned non-zero exit status 1. During handling of the above exception, another exception occurred File "/opt/ml/code/transfer_learning.py", line 68, in <module> run_with_args(args) File "/opt/ml/code/transfer_learning.py", line 42, in run_with_args subprocess.run_with_error_handling(command) File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 12, in run_with_error_handling raise RuntimeError(e) RuntimeError: Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '1', '--gradient_accumulation_steps', '8', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--logging_steps', '8', '--warmup_ratio', '0.1', '--learning_rate', '6e-06', '--weight_decay', '0.2', '--seed', '10', '--max_input_length', '-1', '--validation_split_ratio', '0.2', '--train_data_split_seed', '0', '--max_steps', '-1', '--early_stopping_patience', '3', '--early_stopping_threshold', '0.0', '--adam_beta1', '0.9', '--adam_beta2', '0.999', '--max_grad_norm', '1.0', '--label_smoothing_factor', '0.0', '--logging_strategy', 'steps', '--save_strategy', 'steps', '--save_steps', '500', '--dataloader_num_workers', '0', '--lr_scheduler_type', 'constant_with_warmup', '--warmup_steps', '0', '--evaluation_strategy', 'steps', '--eval_steps', '20', '--lora_r', '8', '--lora_alpha', '16.0', '--lora_dropout', '0.05', '--bits', '16', '--quant_type', 'nf4', '--lora_finetuning', '--load_best_model_at_end', '--bf16', '--instruction_tuned', '--gradient_checkpointing', '--save_total_limit', '1', '--double_quant']' returned non-zero exit status 1."

I'm understanding as that I have troubles with parameter low_cpu_mem_usage=True or passing a device_map. These options are not compatible with DeepSpeed Zero-3.

I'm fine-tuning with SageMaker Studio's UI using the newest SageMaker Studio version. The model is default model in default artifact location.

For the datasets, I've tried using various datasets but caught by same error every time. I've used default dataset in default training dataset location and my custom datasets. I've used English only datasets, Korean only datasets and mixed.

these are my hyper parameters in recent try:

adam_beta1	0.9
adam_beta2	0.999
adam_epsilon	1e-8
auto_find_batch_size	False
bf16	True
bits	16
dataloader_drop_last	False
dataloader_num_workers	0
double_quant	True
early_stopping_patience	3
early_stopping_threshold	0
epoch	1
eval_accumulation_steps	None
eval_steps	20
evaluation_strategy	steps
fp16	False
gradient_accumulation_steps	8
gradient_checkpointing	True
instruction_tuned	True
label_smoothing_factor	0
learning_rate	0.000006
load_best_model_at_end	True
logging_first_step	False
logging_nan_inf_filter	True
logging_steps	8
lora_alpha	16
lora_dropout	0.05
lora_r	8
lr_scheduler_type	constant_with_warmup
max_grad_norm	1
max_input_length	-1
max_steps	-1
max_train_samples	-1
max_val_samples	-1
peft_type	lora
per_device_eval_batch_size	8
per_device_train_batch_size	2
preprocessing_num_workers	None
quant_type	nf4
sagemaker_container_log_level	20
sagemaker_job_name	"jumpstart-dft-huggingface-llm-mistr-20231220-090005"
sagemaker_program	"transfer_learning.py"
sagemaker_region	"us-west-2"
sagemaker_submit_directory	"/opt/ml/input/data/code/sourcedir.tar.gz"
save_steps	500
save_strategy	steps
save_total_limit	1
seed	10
train_data_split_seed	0
train_from_scratch	False
validation_split_ratio	0.2
warmup_ratio	0.1
warmup_steps	0
weight_decay	0.2

In all of my trials, I've only changed the following Hyperparameters Peft Type : None -> lora Lora R dimension : 64 -> 8 Lora Dropout : 0 -> 0.05 I've also tried using default value for Lora R and Lora Dropout

Everything else is just set to default including Training Instance(default=ml.g5.24xlarge). The thing is that I'm getting the same error even though I'm using everything as default except for {peft type : lora}.

I can't figure out the way to solve this problem since the DeepSpeed parameter parsing is not on my hand, I'm just using the UI. Other LLMs like Llama 2 fine-tuning works fine Give me any clues plz.

  • Hi, I suggest to use block quotes to have a pretty display of your messages, config. Posting them as regular text makes them quite unreadable on our side.

  • @Didier_Durand thx for mentioning. I've fixed the post.

  • @posted Just FYSA, adding a cross reference to a similar issue we are experiencing which has been logged with the AWS SageMaker Feedback repo on GitHub.

    https://github.com/aws/amazon-sagemaker-feedback/issues/24

  • @R J Lewis thanks a lot!!

posted
asked 4 months ago349 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions