- Newest
- Most votes
- Most comments
This error indicates there is an issue with passing invalid data types to PyTorch distributed training functions when using multiple GPUs.
Specifically, it looks like you are passing a non-Tensor object to dist.all_reduce() which is expecting a PyTorch Tensor.
Some things to check:
-
Make sure total_loss is a PyTorch Tensor before passing to dist.all_reduce()
-
Verify the data types and shapes of all inputs to the model forward pass match what the model expects
-
Print or log the types and shapes of inputs at various points to validate they are Tensors
-
Try training on a single GPU first to isolate if it is a distributed training issue
-
Ensure the latest compatible versions of PyTorch, TorchDistributed, and other libraries are installed
The key is tracing back to find where a non-Tensor input is being passed and handling that appropriately. Adding more logging or printing can help identify the source. And simplifying to single GPU training can determine if it is a distributed training bug.
Let me know if this helps point you in the right direction or if any other details would help troubleshoot further!
Relevant content
- asked 9 months ago
- asked 8 months ago
- Accepted Answerasked 4 months ago
- AWS OFFICIALUpdated a year ago