Difference in pytorch BERT model output vs neuron

0

I converted a pytorch BERT model to neuron. However the embedding or output tensors which is a list of 1024 size is different..i.e the list sizes are same but individual entries differ. Each of the numbers differs around 1-5% with the original pytorch model output. This is the code that I use to neuron compile the pytorch model.

from transformers import BertTokenizer, BertModel
import torch
import torch_neuron
import os.path
import os
import numpy

tokenizer = BertTokenizer.from_pretrained(modelname, model_max_length=512)
input_str = "The patient's ability is determined based on patients medical parameters, patients history of ability to attend a remote clinician sessions, and physical parameters. Based on identified parameters a patients profile score is calculated to determine patients ability to attend the remote clinician session."
inputs = tokenizer(input_str, padding='max_length', return_tensors="pt")
PATH = './ptmodel/'
fname = 'modelneuron.pt'
kwargs = {'compiler_args':['--fast-math', 'none','--neuroncore-pipeline-cores', '1']}
model = BertModel.from_pretrained(PATH, local_files_only=True, return_dict=False)
neuron_model = torch_neuron.trace(model,
                                    example_inputs = (inputs['input_ids'],inputs['attention_mask'],inputs['token_type_ids']), **kwargs)
neuron_model.save(fname)

Has anyone faced this issue or knows how to solve this??

Thanks Ajay

asked 2 years ago544 views
2 Answers
1

The "--fast-math=none" option you are using is actually internally casting matrix multiply operations to fp16 and is maybe the best option for you already. Some precision loss is expected due to the use of lower precision datatype. However, over a dataset such as MRPC, we see the same BERT accuracy as GPU/CPU. If you see otherwise, please file a ticket via https://github.com/aws-neuron/aws-neuron-sdk/issues or send an email to aws-neuron-support@amazon.com .

answered 2 years ago
  • Im getting reasonable outputs for "--fast-math=none" but without this or using the default, all values in my embedding tensor is NaN. Im doing this so I can get higher throughput, at the cost of some precision but with all NaNs the output is hardly usable. How do I solve this??

  • You should see a noticeable speedup when using either --fast-math=none or the default flags. The --fast-math=none flag disables some optimizations that can impact floating point precision, but the model will still run on the accelerator. If this doesn’t meet your performance requirements let us know the metrics you are observing and we can see if there is more we can do.

    Secondly, there was a known issue in transformers that could cause NaN values to occur on some models in transformers>=4.20 (See: https://github.com/aws-neuron/aws-neuron-sdk/issues/474) This should be resolved as of the 2.5.0 Neuron release: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/rn.html#neuron-2-5-0-11-23-2022

    It is potentially possible that you would be able to get the full performance benefits of using the default flags in addition to getting accurate results by using the latest release of Neuron. This will be model/weight dependent.

  • Hi Jonathan, --fast-math=none does work but would appreciate a greater speedup that the default would provide. In order to use the default, which is currently giving all NaN's in my embedding, do I have to compile with neuron 2.5 or is it just the runtime? I did install neuronx-dkms v 2.6 and also neuronx-tools v 2.6 on my inf1 instance but there is no change. Still getting all NaNs. I do my inferentia neuron model compilation on my laptop mostly, or sometimes a c5.12xlarge if needed. Sorry, Im still new to all of this so my queries may seem silly at times. :). Wishing you a Happy New Year

1

Sample-to-sample variation is expected since CPU architecture is different from Inferentia (and different from GPU), and the order of summation can lead to slightly different results. Will you be able to measure the accuracy over the evaluation data set for both CPU and Inferentia (and GPU also if it is available)?

answered 2 years ago
profile pictureAWS
EXPERT
reviewed a year ago
  • Yes I found some fluctuation in the CPU vs GPU numbers as well, but the fluctuation is smaller by 100 or 1000 times as compared to neuron inferentia. Is that expected? These are the first few numbers from GPU [-1.0692453384399414, -1.4999507665634155, 1.6326944828033447, -0.13731196522712708, -0.8026626110076904, -0.48562130331993103, -0.21466472744941711, 0.44606760144233704,.... These are the first few numbers from CPU [-1.0692460536956787, -1.4999487400054932, 1.6326937675476074, -0.13731253147125244, -0.8026641607284546, -0.48562222719192505, -0.21466375887393951, 0.4460683763027191,.... These are the numbers from inferentia [-1.0766984224319458, -1.4989659786224365, 1.6356642246246338, -0.13928218185901642, -0.8090097904205322, -0.4883664846420288, -0.2172311544418335, 0.4422350823879242,...... The CPU vs GPU numbers differ from about 6 digits after the decimal point but inferentia starts differing from 2 digits after decimal.

  • Yes this is approximately in the range that we expect.

    The fundamental difference in the inf1 Neuron hardware is that all of the matrix multiplication-like operations will be performed in BF16 by default. See the mixed precision guide for more information: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/neuron-cc/mixed-precision.html

    It is sometime possible to achieve better precision with FP16 depending on the model weights and operations. The highest precision FP16-tuned configuration can be achieved using the following flags:

    --fast-math fp32-cast-matmult-fp16 no-fast-relayout

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions