JumpStartEstimator.fit ERROR:UnboundLocalError: local variable 'dataset_train' referenced before assignment

1

Hello, I have a simple model tuning python code, when I run I get the fallowing error, appreciate your help. ERROR: sagemaker.exceptions.UnexpectedStatusException: Error for Training job meta-textgeneration-llama-2-7b-f-2023-11-12-03-11-41-315: Failed. Reason: AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "UnboundLocalError: local variable 'dataset_train' referenced before assignment

Here is the code snippet: estimator = JumpStartEstimator( model_id=jumpstart_model_name, model_version=jumpstart_model_version, instance_count=Instance_Count, instance_type=InstanceType, environment={"accept_eula": "true"}, role=my_role, region=region_name, sagemaker_session=Sagemaker_Session, max_run =MaxTraining_Time_Sec, input_mode="File", output_path =s3_train_model_path, hyperparameters =my_hyperparameters,
debugger_hook_config=hook_config, tensorboard_output_config=tensorboard_output_config, )

#estimator.set_hyperparameters(my_hyperparameters)

channels = { "training": s3_train_data_path, "test": s3_test_data_path, "validation": s3_val_data_path, }

model tuning :

estimator.fit(inputs=channels, logs="All", wait=True)

  • Also getting this when I try to fine-tune a Llama2 chat model via the Sagemaker Jumpstart Studio UI (tried with the 7b and 70b chat variants). Here is the stacktrace I get:

    We encountered an error while training the model on your data. AlgorithmError: ExecuteUserScriptError:
    ExitCode 1
    ErrorMessage "UnboundLocalError: local variable 'dataset_train' referenced before assignment
     Traceback (most recent call last)
     File "/opt/ml/code/llama_finetuning.py", line 336, in <module>
     fire.Fire(main)
     File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
     component_trace = _Fire(component, args, parsed_flag_args, context, name)
     File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
     component, remaining_args = _CallAndUpdateTrace(
     File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
     component = fn(*varargs, **kwargs)
     File "/opt/ml/code/llama_finetuning.py", line 236, in main
     dataset_train, dataset_val = preprocess_instruction_tuned_and_chat_dataset(
    

    For the 70b model, the training fails after ~38 minutes and it seems we do get billed for that time.

    Any ideas whether this is wrong error reporting or a bug on sagemaker side?

reza
asked 6 months ago280 views
3 Answers
0

Hi plamd, I have not figured out a solution yet, but am going to try Hugging Face Llama-2 model training in SageMaker. I am suspecting this is an issue with the llama-2 with jumpstart. see this: https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html good luck!

reza
answered 6 months ago
0

@reza - in my case, this seems to happen when a validation testset is explicitly specified. When I omit the validation testset and just use a training test set, the training runs are passing (some portion of the training set will be used for validation - controlled via validation_split_ratio hyperparam). This is quite limiting (and the error is really misleading) but it's the only way I've been able to get this working.

plamd
answered 6 months ago
0

@plamd thanks for the info, really useful. I need to be able to select my test set manually. I have submitted a formal case, will let you know if I learn something new.

reza
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions