I am trying to train GPT2-large model on Sagemaker Studio -- using a 'ml.g4dn.2xlarge instance. The training file is very small ( 13 kb). It gives the following error:
ExitCode 1
ErrorMessage "RuntimeError: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0
2 root error(s) found.
(0) RESOURCE_EXHAUSTED: Out of memory while trying to allocate 26214400 bytes.
#011 [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
#011 [[XRTExecute_G15]]
(1) RESOURCE_EXHAUSTED: Out of memory while trying to allocate 26214400 bytes.
0 successful operations.
0 derived errors ignored.
Recent warning and error logs
Allocator (GPU_0_bfc) ran out of memory trying to allocate 23.91GiB (rounded to 25677513472)requested by op
*******************************************************************************************_________
Allocator (GPU_0_bfc) ran out of memory trying to allocate 25.00MiB (rounded to 26214400)requested by op
OP_REQUIRES failed at xrt_execute_op.cc:432 : RESOURCE_EXHAUSTED: Out of memory while trying to allocate 26214400 bytes.
20%|██ | 1/5 [03:23<13:33, 203.35s/it]"
Command "/opt/conda/bin/python3.8 run_clm.py --do_train True --model_name_or_path gpt2-large --num_train_epochs 5 --output_dir /opt/ml/model --per_device_train_batch_size 10 --train_file /opt/ml/input/data/train/train.txt"
2023-03-30 03:44:11,044 sagemaker-training-toolkit ERROR Encountered exit_code 1
The training config is as follows:
huggingface_estimator = HuggingFace(
entry_point='run_clm.py',
source_dir='./examples/pytorch/language-modeling',
instance_type='ml.g4dn.2xlarge',
instance_count=1,
role=role,
git_config=git_config,
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
hyperparameters = hyper_params,
compiler_config = TrainingCompilerConfig(),
environment = { 'GPU_NUM_DEVICES' : '1' },
disable_profiler = True,
debugger_hook_config = False
)
huggingface_estimator.fit({'train': s3_training_data}, wait = True)
Similar errors happen for any Huggingface gpt model other than the basic GPT2( smallest). I am using a a fairly