Error while compiling and running the LLM on Inf2 instance

0

Hi, I am trying to deploy the Databricks open source LLM i.e Dolly on inf2 instance. Instance type is inf2.24xlarge used the AMI Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 2023051. I am able to compile the model however while loading the compiled mode i am getting the following error: Unknown opcode for unpickling at 0x59: 89

Details are: I ran the following code for the compilation which took 789.6257681846619 seconds to compile and generated the *.pt file of 11.8GB

import torch
import torch_neuronx
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM
import transformers
from tqdm import tqdm as tqdm
from transformers import pipeline
import time
  
# Create the tokenizer and model
name = "databricks/dolly-v2-7b"

tokenizer = AutoTokenizer.from_pretrained(name)
tokenizer.pad_token = tokenizer.eos_token # Define the padding token value
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", torchscript=True)
model.eval()
text = "Explain to me the difference between nuclear fission and fusion."

#Encoding code
token_encode_start_time = time.time()
encoded_input =  tokenizer(text, return_tensors='pt')
token_encode_completion_time = time.time()
token_encoding_time = token_encode_completion_time - token_encode_start_time
print('Decode time:', token_encoding_time, 'seconds')
print(encoded_input)

neuron_input =( 
    encoded_input['input_ids'],
    encoded_input['attention_mask']
   )

#inference code on CPU
token_inference_start_time = time.time()
output_cpu = model(*neuron_input)
token_inference_end_time = time.time()
token_inference_time = token_inference_end_time - token_inference_start_time
print('Inference time:', token_inference_time, 'seconds')
#print(tokenizer.decode(output_cpu[0]))
print(output_cpu)

# Compilation code
model.eval()
print("evaluation done")
compile_start_time = time.time()
model_neuron = torch_neuronx.trace(model, neuron_input)
compile_end_time = time.time()
compile_time = compile_end_time - compile_start_time
print('compile_time:', compile_time, 'seconds')


# save compiled model
filename = "dolly_neuron.pt"
torch.jit.save(model_neuron, filename)
print(encoded_input)

When i am trying to load the saved model for inference using the code:

neuron_model = torch.jit.load("dolly_neuron.pt") I am getting the following error:

Unknown opcode for unpickling at 0x59: 89

Call stack:

/home/ubuntu/bert-example/huggingface-demos/inferentia2/dolly/dolly_test.py:41 in <module>       │
│                                                                                                  │
│   38                                                                                             │
│   39                                                                                             │
│   40 # Load TorchScript back                                                                     │
│ ❱ 41 neuron_model = torch.jit.load("dolly_neuron_2.pt")                                          │
│   42                                                                                             │
│   43 encoded_input =  tokenizer(text, return_tensors='pt')                                       │
│   44                                                                                             │
│                                                                                                  │
│ /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/jit/_serialization.py:162 in load │
│                                                                                                  │
│   159 │                                                                                          │
│   160 │   cu = torch._C.CompilationUnit()                                                        │
│   161 │   if isinstance(f, str) or isinstance(f, pathlib.Path):                                  │
162 │   │   cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)     │
│   163 │   else:                                                                                  │
│   164 │   │   cpp_module = torch._C.import_ir_module_from_buffer(                                │
│   165 │   │   │   cu, f.read(), map_location, _extra_files      

Any pointers will be really helpful.

2 Antworten
0
Akzeptierte Antwort

There is a known issue with this model type that the team are working on and we will have a fix in an upcoming release of the Neuron SDK. Since this is a large model (7B params) you will need to shard it across multiple NeuronCores and we generally recommend using the transformers-neuronx library which provides Tensor Parallel and other features to assist in deploying Decoder based LLMs to inf2 based instances. Soon this library will also have support for GPT-NeoX which is the underlying architecture for this model. Keep an eye out for new announcements from the Neuron SDK docs release notes page here.

profile pictureAWS
EXPERTE
beantwortet vor einem Jahr
  • Based on this documentation https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/model-architecture-fit.html#aws-trainium-and-aws-inferentia2-neuroncore-v2. It was not clear that GPT-NeoX is not supported, as this says decoder only architecture is a good-fit for inferencing and does not talk about model-specific architecture. Thanks for the response; this is helpful, and good to know there are plans to support this.

  • I have a couple of follow-up questions :

    1. Does the user need to wait for every new model/architecture until the support is not out in transformers-neuronx SDK? Say, tomorrow, I want to switch from a Dolly to another LLM; as a user, can I compile the model and start using it similarly to the torch-neuronx examples?
    2. In transformers-neuronx, when we convert the model to a neuron model using model_neuron.to_neuron(). Can we store and load this model using torch.jit.save() and torch.jit.load() ?
0

To your follow up questions:

  1. transformers-neuronx is a solution for decoder inference only. It is a model zoo approach that means either our team needs to add support for a new model type or customer can freely fork or submit a PR for a new model architecture as it is open source. We will release a library soon allowing customers to shard model weights using Tensor Parallel & Pipeline Parallel. Keep an eye out on the release notes of the Neuron SDK in the coming months.
  2. We don't use PyTorch at all in transformers-neuronx so torch.jit.save() and torch.jit.load() won't work. You can see examples of serialization here: https://github.com/aws-neuron/transformers-neuronx#serialization-support
profile pictureAWS
EXPERTE
beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen