Unable to compile model to Neuron: no error message, no output

0

Hi. We are trying to convert all our in-house pytorch models to aws-neuron on inferentia. We successfully converted one, but the second model we tried did not compile. Unfortunately, compilation did not generate any error message nor log of any kind, so we are stuck. The model is rather simple, but large, U-Net, with partial convolutions instead of regular ones, but otherwise no fancy operators. Conversion of this model to torchscript is ok on the same instance. Could it be a memory problem ?

已提問 2 年前檢視次數 325 次
2 個答案
2
已接受的答案

Hi, in order to see more information about the error, you can enable debugging during tracing by passing 'verbose' to the tracing command like this:

import torch
import torch.neuron
torch.neuron.trace(
    model,
    example_inputs=inp,
    verbose="debug",
    compiler_workdir="logs" # dir where debugging logs will be saved
)

You'll see the error messages in the console and they will also be saved to the "logs" dir.

It is always good to run the NeuronSDK analyzer first to make sure the model is: 1/ torch.jit traceable; 2/ supported by the compiler

import torch
import torch.neuron
torch.neuron.analyze_model(model, example_inputs=inp)

You can also see a sample that shows how to compile an U-net Pytorch (3rd party implementation) to Inf1 instances here: https://github.com/samir-souza/laboratory/blob/master/05_Inferentia/03_UnetPytorch/03_UnetPytorch.ipynb

Ref: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/api-compilation-python-api.html

If everything fails, try to look for something like this in the logs:

INFO:Neuron:Compile command returned: -11
WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$647; falling back to native python function call
ERROR:Neuron:neuron-cc failed with the following command line call:

And paste here, please. With the "Compile command returned:" code it is possible to identify the error. You are suspecting that there is some issue related to memory, maybe Out of Memory. Normally when that is the case, you'll find the code: -9 in this part of the error.

AWS
已回答 2 年前
0

Following your answer we were able to check the log and got

INFO:Neuron:Compile command returned: -9

which is apparently an out of memory error. Switching to a 6x instance solved the problem

已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南