1 Answer
- Newest
- Most votes
- Most comments
0
The runtime command doesn't look right. In the notbeook, the model is trained using model parallel which means the whole model will be partitioned and spread all available GPU devices.
However, actual running command is
mpirun --host algo-1 -np 1 ... ...
This will launch only one process and use one GPU devices, which is not possible to host most GPT models with only 16GB GPU memory.
Have you modified any parameters relating to -np
? What is the value of processes_per_host
before the smp_estimator.fit
cell ?
answered 2 years ago
Relevant content
- asked 3 years ago
- asked a year ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Great catch, I missed that. I'll run it through again to see if it fixes the issue. I ended up getting it to run but had to decrease the batch-size significantly along with a few other tweaks. I also had to adjust the processes per host value because that threw an error as well. I'll report back on what I find.