Trying to train GPT2-large large and running out of memory

0

I am trying to train GPT2-large model on Sagemaker Studio -- using a 'ml.g4dn.2xlarge instance. The training file is very small ( 13 kb). It gives the following error:

ExitCode 1 ErrorMessage "RuntimeError: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0 2 root error(s) found. (0) RESOURCE_EXHAUSTED: Out of memory while trying to allocate 26214400 bytes. #011 [[{{node XRTExecute}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

#011 [[XRTExecute_G15]] (1) RESOURCE_EXHAUSTED: Out of memory while trying to allocate 26214400 bytes. 0 successful operations. 0 derived errors ignored. Recent warning and error logs Allocator (GPU_0_bfc) ran out of memory trying to allocate 23.91GiB (rounded to 25677513472)requested by op *******************************************************************************************_________ Allocator (GPU_0_bfc) ran out of memory trying to allocate 25.00MiB (rounded to 26214400)requested by op


OP_REQUIRES failed at xrt_execute_op.cc:432 : RESOURCE_EXHAUSTED: Out of memory while trying to allocate 26214400 bytes. 20%|██ | 1/5 [03:23<13:33, 203.35s/it]" Command "/opt/conda/bin/python3.8 run_clm.py --do_train True --model_name_or_path gpt2-large --num_train_epochs 5 --output_dir /opt/ml/model --per_device_train_batch_size 10 --train_file /opt/ml/input/data/train/train.txt" 2023-03-30 03:44:11,044 sagemaker-training-toolkit ERROR Encountered exit_code 1

The training config is as follows:

huggingface_estimator = HuggingFace( entry_point='run_clm.py', source_dir='./examples/pytorch/language-modeling', instance_type='ml.g4dn.2xlarge', instance_count=1, role=role, git_config=git_config, transformers_version='4.17.0', pytorch_version='1.10.2', py_version='py38', hyperparameters = hyper_params, compiler_config = TrainingCompilerConfig(), environment = { 'GPU_NUM_DEVICES' : '1' }, disable_profiler = True, debugger_hook_config = False )

huggingface_estimator.fit({'train': s3_training_data}, wait = True)

Similar errors happen for any Huggingface gpt model other than the basic GPT2( smallest). I am using a a fairly

gefragt vor einem Jahr512 Aufrufe
1 Antwort
1

You might be hitting GPU memory issue. It will be good idea to try with g5.2xlarge or p3.2xlarge. Also, suggest looking at this example - https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/smp-train-gpt-simple.ipynb

profile pictureAWS
EXPERTE
beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen