JumpStart Llama2 Finetuning - Recurring Runtime Error

0

Hi,

I'm using a small dataset to finetune a Llama2 model. Each time I try to train via JumpStart, I'm stalling out with the same error:

We encountered an error while training the model on your data. AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise RuntimeError( RuntimeError: Invalid function argument. Expected parameter tensor to be of type torch.Tensor. Traceback (most recent call last) File "/opt/ml/code/llama_finetuning.py", line 301, in <module> fire.Fire(main) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/opt/ml/code/llama_finetuning.py", line 281, in main results = train( File "/opt/ml/code/llama-recipes/utils/train_utils.py", line 117, in train dist.all_reduce(total_loss, op=dist.ReduceOp.SUM) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper

Can anyone suggest what I may be doing wrong here?

TW2023
asked 8 months ago600 views
1 Answer
0

This error indicates there is an issue with passing invalid data types to PyTorch distributed training functions when using multiple GPUs.

Specifically, it looks like you are passing a non-Tensor object to dist.all_reduce() which is expecting a PyTorch Tensor.

Some things to check:

  • Make sure total_loss is a PyTorch Tensor before passing to dist.all_reduce()

  • Verify the data types and shapes of all inputs to the model forward pass match what the model expects

  • Print or log the types and shapes of inputs at various points to validate they are Tensors

  • Try training on a single GPU first to isolate if it is a distributed training issue

  • Ensure the latest compatible versions of PyTorch, TorchDistributed, and other libraries are installed

The key is tracing back to find where a non-Tensor input is being passed and handling that appropriately. Adding more logging or printing can help identify the source. And simplifying to single GPU training can determine if it is a distributed training bug.

Let me know if this helps point you in the right direction or if any other details would help troubleshoot further!

AWS
Saad
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions