Trying to train GPT2-large large and running out of memory


I am trying to train GPT2-large model on Sagemaker Studio -- using a 'ml.g4dn.2xlarge instance. The training file is very small ( 13 kb). It gives the following error:

ExitCode 1 ErrorMessage "RuntimeError: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0 2 root error(s) found. (0) RESOURCE_EXHAUSTED: Out of memory while trying to allocate 26214400 bytes. #011 [[{{node XRTExecute}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

#011 [[XRTExecute_G15]] (1) RESOURCE_EXHAUSTED: Out of memory while trying to allocate 26214400 bytes. 0 successful operations. 0 derived errors ignored. Recent warning and error logs Allocator (GPU_0_bfc) ran out of memory trying to allocate 23.91GiB (rounded to 25677513472)requested by op *******************************************************************************************_________ Allocator (GPU_0_bfc) ran out of memory trying to allocate 25.00MiB (rounded to 26214400)requested by op

OP_REQUIRES failed at : RESOURCE_EXHAUSTED: Out of memory while trying to allocate 26214400 bytes. 20%|██ | 1/5 [03:23<13:33, 203.35s/it]" Command "/opt/conda/bin/python3.8 --do_train True --model_name_or_path gpt2-large --num_train_epochs 5 --output_dir /opt/ml/model --per_device_train_batch_size 10 --train_file /opt/ml/input/data/train/train.txt" 2023-03-30 03:44:11,044 sagemaker-training-toolkit ERROR Encountered exit_code 1

The training config is as follows:

huggingface_estimator = HuggingFace( entry_point='', source_dir='./examples/pytorch/language-modeling', instance_type='ml.g4dn.2xlarge', instance_count=1, role=role, git_config=git_config, transformers_version='4.17.0', pytorch_version='1.10.2', py_version='py38', hyperparameters = hyper_params, compiler_config = TrainingCompilerConfig(), environment = { 'GPU_NUM_DEVICES' : '1' }, disable_profiler = True, debugger_hook_config = False ){'train': s3_training_data}, wait = True)

Similar errors happen for any Huggingface gpt model other than the basic GPT2( smallest). I am using a a fairly

asked a year ago500 views
1 Answer

You might be hitting GPU memory issue. It will be good idea to try with g5.2xlarge or p3.2xlarge. Also, suggest looking at this example -

profile pictureAWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions