Unable to compile model to Neuron: no error message, no output

0

Hi. We are trying to convert all our in-house pytorch models to aws-neuron on inferentia. We successfully converted one, but the second model we tried did not compile. Unfortunately, compilation did not generate any error message nor log of any kind, so we are stuck. The model is rather simple, but large, U-Net, with partial convolutions instead of regular ones, but otherwise no fancy operators. Conversion of this model to torchscript is ok on the same instance. Could it be a memory problem ?

質問済み 2年前325ビュー
2回答
2
承認された回答

Hi, in order to see more information about the error, you can enable debugging during tracing by passing 'verbose' to the tracing command like this:

import torch
import torch.neuron
torch.neuron.trace(
    model,
    example_inputs=inp,
    verbose="debug",
    compiler_workdir="logs" # dir where debugging logs will be saved
)

You'll see the error messages in the console and they will also be saved to the "logs" dir.

It is always good to run the NeuronSDK analyzer first to make sure the model is: 1/ torch.jit traceable; 2/ supported by the compiler

import torch
import torch.neuron
torch.neuron.analyze_model(model, example_inputs=inp)

You can also see a sample that shows how to compile an U-net Pytorch (3rd party implementation) to Inf1 instances here: https://github.com/samir-souza/laboratory/blob/master/05_Inferentia/03_UnetPytorch/03_UnetPytorch.ipynb

Ref: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/api-compilation-python-api.html

If everything fails, try to look for something like this in the logs:

INFO:Neuron:Compile command returned: -11
WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$647; falling back to native python function call
ERROR:Neuron:neuron-cc failed with the following command line call:

And paste here, please. With the "Compile command returned:" code it is possible to identify the error. You are suspecting that there is some issue related to memory, maybe Out of Memory. Normally when that is the case, you'll find the code: -9 in this part of the error.

AWS
回答済み 2年前
0

Following your answer we were able to check the log and got

INFO:Neuron:Compile command returned: -9

which is apparently an out of memory error. Switching to a 6x instance solved the problem

回答済み 2年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ