Error while compiling and running the LLM on Inf2 instance

0

Hi, I am trying to deploy the Databricks open source LLM i.e Dolly on inf2 instance. Instance type is inf2.24xlarge used the AMI Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 2023051. I am able to compile the model however while loading the compiled mode i am getting the following error: Unknown opcode for unpickling at 0x59: 89

Details are: I ran the following code for the compilation which took 789.6257681846619 seconds to compile and generated the *.pt file of 11.8GB

import torch
import torch_neuronx
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM
import transformers
from tqdm import tqdm as tqdm
from transformers import pipeline
import time
  
# Create the tokenizer and model
name = "databricks/dolly-v2-7b"

tokenizer = AutoTokenizer.from_pretrained(name)
tokenizer.pad_token = tokenizer.eos_token # Define the padding token value
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", torchscript=True)
model.eval()
text = "Explain to me the difference between nuclear fission and fusion."

#Encoding code
token_encode_start_time = time.time()
encoded_input =  tokenizer(text, return_tensors='pt')
token_encode_completion_time = time.time()
token_encoding_time = token_encode_completion_time - token_encode_start_time
print('Decode time:', token_encoding_time, 'seconds')
print(encoded_input)

neuron_input =( 
    encoded_input['input_ids'],
    encoded_input['attention_mask']
   )

#inference code on CPU
token_inference_start_time = time.time()
output_cpu = model(*neuron_input)
token_inference_end_time = time.time()
token_inference_time = token_inference_end_time - token_inference_start_time
print('Inference time:', token_inference_time, 'seconds')
#print(tokenizer.decode(output_cpu[0]))
print(output_cpu)

# Compilation code
model.eval()
print("evaluation done")
compile_start_time = time.time()
model_neuron = torch_neuronx.trace(model, neuron_input)
compile_end_time = time.time()
compile_time = compile_end_time - compile_start_time
print('compile_time:', compile_time, 'seconds')


# save compiled model
filename = "dolly_neuron.pt"
torch.jit.save(model_neuron, filename)
print(encoded_input)

When i am trying to load the saved model for inference using the code:

neuron_model = torch.jit.load("dolly_neuron.pt") I am getting the following error:

Unknown opcode for unpickling at 0x59: 89

Call stack:

/home/ubuntu/bert-example/huggingface-demos/inferentia2/dolly/dolly_test.py:41 in <module>       │
│                                                                                                  │
│   38                                                                                             │
│   39                                                                                             │
│   40 # Load TorchScript back                                                                     │
│ ❱ 41 neuron_model = torch.jit.load("dolly_neuron_2.pt")                                          │
│   42                                                                                             │
│   43 encoded_input =  tokenizer(text, return_tensors='pt')                                       │
│   44                                                                                             │
│                                                                                                  │
│ /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/jit/_serialization.py:162 in load │
│                                                                                                  │
│   159 │                                                                                          │
│   160 │   cu = torch._C.CompilationUnit()                                                        │
│   161 │   if isinstance(f, str) or isinstance(f, pathlib.Path):                                  │
162 │   │   cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)     │
│   163 │   else:                                                                                  │
│   164 │   │   cpp_module = torch._C.import_ir_module_from_buffer(                                │
│   165 │   │   │   cu, f.read(), map_location, _extra_files      

Any pointers will be really helpful.

2 Answers
0
Accepted Answer

There is a known issue with this model type that the team are working on and we will have a fix in an upcoming release of the Neuron SDK. Since this is a large model (7B params) you will need to shard it across multiple NeuronCores and we generally recommend using the transformers-neuronx library which provides Tensor Parallel and other features to assist in deploying Decoder based LLMs to inf2 based instances. Soon this library will also have support for GPT-NeoX which is the underlying architecture for this model. Keep an eye out for new announcements from the Neuron SDK docs release notes page here.

profile pictureAWS
EXPERT
answered a year ago
  • Based on this documentation https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/model-architecture-fit.html#aws-trainium-and-aws-inferentia2-neuroncore-v2. It was not clear that GPT-NeoX is not supported, as this says decoder only architecture is a good-fit for inferencing and does not talk about model-specific architecture. Thanks for the response; this is helpful, and good to know there are plans to support this.

  • I have a couple of follow-up questions :

    1. Does the user need to wait for every new model/architecture until the support is not out in transformers-neuronx SDK? Say, tomorrow, I want to switch from a Dolly to another LLM; as a user, can I compile the model and start using it similarly to the torch-neuronx examples?
    2. In transformers-neuronx, when we convert the model to a neuron model using model_neuron.to_neuron(). Can we store and load this model using torch.jit.save() and torch.jit.load() ?
0

To your follow up questions:

  1. transformers-neuronx is a solution for decoder inference only. It is a model zoo approach that means either our team needs to add support for a new model type or customer can freely fork or submit a PR for a new model architecture as it is open source. We will release a library soon allowing customers to shard model weights using Tensor Parallel & Pipeline Parallel. Keep an eye out on the release notes of the Neuron SDK in the coming months.
  2. We don't use PyTorch at all in transformers-neuronx so torch.jit.save() and torch.jit.load() won't work. You can see examples of serialization here: https://github.com/aws-neuron/transformers-neuronx#serialization-support
profile pictureAWS
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions