Unable to compile model to Neuron: no error message, no output

0

Hi. We are trying to convert all our in-house pytorch models to aws-neuron on inferentia. We successfully converted one, but the second model we tried did not compile. Unfortunately, compilation did not generate any error message nor log of any kind, so we are stuck. The model is rather simple, but large, U-Net, with partial convolutions instead of regular ones, but otherwise no fancy operators. Conversion of this model to torchscript is ok on the same instance. Could it be a memory problem ?

2 Answers
2
Accepted Answer

Hi, in order to see more information about the error, you can enable debugging during tracing by passing 'verbose' to the tracing command like this:

import torch
import torch.neuron
torch.neuron.trace(
    model,
    example_inputs=inp,
    verbose="debug",
    compiler_workdir="logs" # dir where debugging logs will be saved
)

You'll see the error messages in the console and they will also be saved to the "logs" dir.

It is always good to run the NeuronSDK analyzer first to make sure the model is: 1/ torch.jit traceable; 2/ supported by the compiler

import torch
import torch.neuron
torch.neuron.analyze_model(model, example_inputs=inp)

You can also see a sample that shows how to compile an U-net Pytorch (3rd party implementation) to Inf1 instances here: https://github.com/samir-souza/laboratory/blob/master/05_Inferentia/03_UnetPytorch/03_UnetPytorch.ipynb

Ref: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/api-compilation-python-api.html

If everything fails, try to look for something like this in the logs:

INFO:Neuron:Compile command returned: -11
WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$647; falling back to native python function call
ERROR:Neuron:neuron-cc failed with the following command line call:

And paste here, please. With the "Compile command returned:" code it is possible to identify the error. You are suspecting that there is some issue related to memory, maybe Out of Memory. Normally when that is the case, you'll find the code: -9 in this part of the error.

AWS
answered 2 years ago
0

Following your answer we were able to check the log and got

INFO:Neuron:Compile command returned: -9

which is apparently an out of memory error. Switching to a 6x instance solved the problem

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions