Hi,
I am trying to deploy the Databricks open source LLM i.e Dolly on inf2 instance. Instance type is inf2.24xlarge
used the AMI Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 2023051
.
I am able to compile the model however while loading the compiled mode i am getting the following error:
Unknown opcode for unpickling at 0x59: 89
Details are:
I ran the following code for the compilation which took 789.6257681846619 seconds
to compile and generated the *.pt
file of 11.8GB
import torch
import torch_neuronx
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM
import transformers
from tqdm import tqdm as tqdm
from transformers import pipeline
import time
# Create the tokenizer and model
name = "databricks/dolly-v2-7b"
tokenizer = AutoTokenizer.from_pretrained(name)
tokenizer.pad_token = tokenizer.eos_token # Define the padding token value
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", torchscript=True)
model.eval()
text = "Explain to me the difference between nuclear fission and fusion."
#Encoding code
token_encode_start_time = time.time()
encoded_input = tokenizer(text, return_tensors='pt')
token_encode_completion_time = time.time()
token_encoding_time = token_encode_completion_time - token_encode_start_time
print('Decode time:', token_encoding_time, 'seconds')
print(encoded_input)
neuron_input =(
encoded_input['input_ids'],
encoded_input['attention_mask']
)
#inference code on CPU
token_inference_start_time = time.time()
output_cpu = model(*neuron_input)
token_inference_end_time = time.time()
token_inference_time = token_inference_end_time - token_inference_start_time
print('Inference time:', token_inference_time, 'seconds')
#print(tokenizer.decode(output_cpu[0]))
print(output_cpu)
# Compilation code
model.eval()
print("evaluation done")
compile_start_time = time.time()
model_neuron = torch_neuronx.trace(model, neuron_input)
compile_end_time = time.time()
compile_time = compile_end_time - compile_start_time
print('compile_time:', compile_time, 'seconds')
# save compiled model
filename = "dolly_neuron.pt"
torch.jit.save(model_neuron, filename)
print(encoded_input)
When i am trying to load the saved model for inference using the code:
neuron_model = torch.jit.load("dolly_neuron.pt")
I am getting the following error:
Unknown opcode for unpickling at 0x59: 89
Call stack:
/home/ubuntu/bert-example/huggingface-demos/inferentia2/dolly/dolly_test.py:41 in <module> │
│ │
│ 38 │
│ 39 │
│ 40 # Load TorchScript back │
│ ❱ 41 neuron_model = torch.jit.load("dolly_neuron_2.pt") │
│ 42 │
│ 43 encoded_input = tokenizer(text, return_tensors='pt') │
│ 44 │
│ │
│ /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/jit/_serialization.py:162 in load │
│ │
│ 159 │ │
│ 160 │ cu = torch._C.CompilationUnit() │
│ 161 │ if isinstance(f, str) or isinstance(f, pathlib.Path): │
162 │ │ cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files) │
│ 163 │ else: │
│ 164 │ │ cpp_module = torch._C.import_ir_module_from_buffer( │
│ 165 │ │ │ cu, f.read(), map_location, _extra_files
Any pointers will be really helpful.
Based on this documentation https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/model-architecture-fit.html#aws-trainium-and-aws-inferentia2-neuroncore-v2. It was not clear that GPT-NeoX is not supported, as this says decoder only architecture is a good-fit for inferencing and does not talk about model-specific architecture. Thanks for the response; this is helpful, and good to know there are plans to support this.
I have a couple of follow-up questions :