You can find the notebook by going to sagemaker studio -> home -> jumpstart -> Falcon 7B Instruct BF16 -> notebook
I did not change anything in the notebook. When the training starts, it errors out for me.
Cloudwatch:
2023-08-15T18:24:01.356-04:00 [INFO|trainer.py:1769] 2023-08-15 22:24:01,348 >> ***** Running training *****
2023-08-15T18:24:01.356-04:00 [INFO|trainer.py:1770] 2023-08-15 22:24:01,348 >> Num examples = 1,054
2023-08-15T18:24:01.356-04:00 [INFO|trainer.py:1771] 2023-08-15 22:24:01,348 >> Num Epochs = 1
2023-08-15T18:24:01.356-04:00 [INFO|trainer.py:1772] 2023-08-15 22:24:01,348 >> Instantaneous batch size per device = 2
2023-08-15T18:24:01.356-04:00 [INFO|trainer.py:1773] 2023-08-15 22:24:01,348 >> Total train batch size (w. parallel, distributed & accumulation) = 16
2023-08-15T18:24:01.356-04:00 [INFO|trainer.py:1774] 2023-08-15 22:24:01,348 >> Gradient Accumulation steps = 2
2023-08-15T18:24:01.356-04:00 [INFO|trainer.py:1775] 2023-08-15 22:24:01,348 >> Total optimization steps = 66
2023-08-15T18:24:01.356-04:00 [INFO|trainer.py:1776] 2023-08-15 22:24:01,349 >> Number of trainable parameters = 6,921,720,704
2023-08-15T18:24:01.356-04:00 [INFO|trainer.py:1776] 2023-08-15 22:24:01,349 >> Number of trainable parameters = 6,921,720,704
2023-08-15T18:24:02.357-04:00 0%| | 0/66 [00:00<?, ?it/s]
2023-08-15T18:24:07.358-04:00 ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
Training job in sagemaker:
AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "│ 154 │ │ │ raise RuntimeError( │ │ 155 │ │ │ │ "none of output has requires_grad=True," │ │ 156 │ │ │ │ " this checkpoint() is not necessary") │ │ ❱ 157 │ │ torch.autograd.backward(outputs_with_grad, args_with_grad) │ │ 158 │ │ grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else N │ │ 159 │ │ │ │ │ for inp in detached_inputs) │ │ 160 │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in │ │ backward │ │ 197 │ # The reason we repeat same the comment below is that , exit code: 1
The above is the output of section 3.3 in the notebook, but 2.3 also has the same issue. I can manually train the model (instead of using step 2.3) if I go to sagemaker studio -> Falcon 7B Instruct BF16 -> train tab. However, I can't for the step 3.3, it also results the the above issue. I also tried changing the training parameters without much success.