gptj_demo compilation failed on Inf2

0

Hi, I'm trying to run the gptj_demo on Inf2 with AMI Deep Learning AMI Neuron PyTorch 1.13.0 (Ubuntu 20.04) 20230405 and installed the pytorch neuron as https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html#pytorch-neuronx-install.

While running: (aws_neuron_venv_pytorch) ubuntu@ip-172-31-32-224:~$ gptj_demo run gpt-j-6B-split , I got the following exception: 2023-04-26T01:31:26Z ERROR 41469 [WalrusDriver]: Walrus pass: birverifier failed! 2023-04-26T01:31:26Z ERROR 41469 [WalrusDriver]: Failure Reason: === BIR verification failed === Reason: Expect memory location to be of type SB Instruction: I-26932 Opcode: IndirectSave Input index: 1 Argument AP: Access Pattern: [[512,4],[512,1],[1,512]] SymbolicAP Memory Location: {_reshape_382_hlo_id_3947__mhlo.reshape_32_pftranspose_12031_set}@PSUM 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: *************************************************************** 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: An Internal Compiler Error has occurred 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: *************************************************************** 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: Error message: Walrus driver failed to complete 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: Error class: AssertionError 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: Error location: Unknown 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: Command line: /opt/aws_neuron_venv_pytorch/bin/neuronx-cc compile --framework=XLA --target=trn1 /tmp/tmpd5jgl51u/hlo_module.pb --output=/tmp/tmpd5jgl51u/hlo_module.pb.neff --verbose=35

2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: Version information: 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: NeuronX Compiler version 2.5.0.28+1be23f232 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: HWM version 2.5.0.0-dad732dd6 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: NEFF version Dynamic 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: TVM not available 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: NumPy version 1.21.6 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: MXNet not available 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: 2023-04-26T01:31:26Z ERROR 41469 [neuronx-cc]: Artifacts stored in: /tmp/tmpd5jgl51u/neuronxcc-ng2z05sr

I'm not sure whether it's related to the --target=trn1 which seems to be hard coded in: https://github.com/aws-neuron/transformers-neuronx/blob/1e72ddc31976925ba0c79e2ff12301ff3bd6b920/src/transformers_neuronx/compiler.py#L59

Thanks.

AWS
asked a year ago453 views
1 Answer
0

We are unable reproduce the issue using the packages in the same AMI you used with the latest transformers-neuronx and latest transfomers installed using "!pip install git+https://github.com/aws-neuron/transformers-neuronx.git transformers -U" . If you still having further issues, please share steps to produce the issue including setup commands and script to run in a GitHub issue via https://github.com/aws-neuron/aws-neuron-sdk.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions