By using AWS re:Post, you agree to the Terms of Use

What is a practical Inferentia limit to model size?


I am trying to test a model compiled for Inferentia on an inf1.2xlarge, but when loading the model I receive the following error messages:

2022-Sep-15 22:10:01.0152  3802:3802  ERROR  TDRV:dmem_alloc                              Failed to alloc DEVICE memory: 1073741824
2022-Sep-15 22:10:01.0152  3802:3802  ERROR  TDRV:dma_ring_alloc                          Failed to allocate TX ring
2022-Sep-15 22:10:01.0172  3802:3802  ERROR  TDRV:io_create_rings                         Failed to allocate io ring for queue qPoolOut0_0
2022-Sep-15 22:10:01.0172  3802:3802  ERROR  TDRV:kbl_model_add                           create_io_rings() error
2022-Sep-15 22:10:01.0182  3802:3802  ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2022-Sep-15 22:10:01.0182  3802:3802  ERROR  NMGR:stage_kelf_models                       Failed to stage graph: kelf-a.json to NeuronCore
2022-Sep-15 22:10:01.0184  3802:3802  ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN:, err: 4

These are wrapped into a Python runtime exception:

RuntimeError: Could not load the model status=4 message=Allocation Failure

I presume that this is because the model is on the large size. The .neff file is 373MB and takes ~4 hours to compile for a batch size of 1.

This particular model is compiled for a single Neuron core. I am now trying to compile with --neuroncore-pipeline-cores 4 to spread the model across multiple cores. This however gives me the following log message:

INFO: The requested number of neuroncore-pipeline-cores (4) may not be suitable for this network, and may lead to sub-optimal performance. Recommended neuroncore-pipeline-cores for this network is 1.

(I can't find any technical details on how much memory an Inferentia chip has, although I'm guessing that due to Inferentia architecture "memory" is not used in the same way as it might be on CPU or GPU.)

So, what is a practical size limit for an Inferentia model and what can I do about running this model on Inf1?

1 Answer

Hi @ntw-au. We are working on some more general guidance (which you have requested here), but it looks like your model should load on one core once compiled. Are you able to share more information for us to diagnose?

answered 8 days ago
  • Thank you, I'm not able to share details publicly (happy to talk privately), but input is dimension 1×32768×3 and the model is multiple layers of convolution and batch normalisation, running on PyTorch. The trained model is 15MB.

  • Also, native data type is generally FP32, with integers for indexes

  • If you could please email us at we can start a more in depth conversation. In short we won't need the exact model or weights, but something close enough to replicate the failure. Because of operator fusing that happens during compilation this will likely need to be relatively close in structure.

  • Hi @ntw-au, are you able to email us ( so that we can start a private conversation about the model?

  • Thanks @mrnikwaws and @AWS-mvaria, I've sent an email to your support address

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions