Neuron model loads when compiled for 1 core but fails to load when compiled for 4

0

Hello, We are testing the pipeline mode for neuron/inferentia, but can not get a model running for multi-core. The single core compiled model loads fine and is able to run inference on inferentia without issue. However, after compiling a model for multi-core using compiler-args=['--neuroncore-pipeline-cores', '4'] (which takes ~16hrs on a r6a.16xl) the model errors out while loading into memory on the inferentia box. Here's the error message:

2022-Nov-22 22:29:25.0728 20764:22801 ERROR  TDRV:dmem_alloc                              Failed to alloc DEVICE memory: 589824
2022-Nov-22 22:29:25.0728 20764:22801 ERROR  TDRV:copy_and_stage_mr_one_channel           Failed to allocate aligned (0) buffer in MLA DRAM for W10-t of size 589824 bytes, channel 0
2022-Nov-22 22:29:25.0728 20764:22801 ERROR  TDRV:kbl_model_add                           copy_and_stage_mr() error
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:dmem_alloc                              Failed to alloc DEVICE memory: 16777216
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:dma_ring_alloc                          Failed to allocate RX ring
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:drs_create_data_refill_rings            Failed to allocate pring for data refill dma
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:kbl_model_add                           create_data_refill_rings() error
2022-Nov-22 22:29:26.0116 20764:20764 ERROR  TDRV:remove_model                            Unknown model: 1001
2022-Nov-22 22:29:26.0116 20764:20764 ERROR  TDRV:kbl_model_remove                        Failed to find and remove model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR  TDRV:remove_model                            Unknown model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR  TDRV:kbl_model_remove                        Failed to find and remove model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2022-Nov-22 22:29:26.0354 20764:20764 ERROR  NMGR:stage_kelf_models                       Failed to stage graph: kelf-a.json to NeuronCore
2022-Nov-22 22:29:26.0364 20764:20764 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: 1.11.7.0+aec18907e-/tmp/tmpab7oth00, err: 4
Traceback (most recent call last):
  File "infer_test.py", line 34, in <module>
    model_neuron = torch.jit.load('model-4c.pt')
  File "/root/pytorch_venv/lib64/python3.7/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
    script_module = jit_load(*args, **kwargs)
  File "/root/pytorch_venv/lib64/python3.7/site-packages/torch/jit/_serialization.py", line 162, in load
    cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: Could not load the model status=4 message=Allocation Failure

Any help would be appreciated.

demandé il y a un an385 vues
1 réponse
0

Hi - can you confirm the type of Inf1 instance you are using for this and if you are using any container configuration? Also - how many cores have you assigned to the process ? I am looking to ensure you have assigned sufficient cores to the process (as shown here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-configurable-parameters.html#nrt-configuration )

AWS
répondu il y a un an
  • This is being ran on an inf1.2xl box currently, just in a one-off dev config with the DL AMI. I have not changed the env vars, based on the documentation it would seem like by default all cores would be assigned to the process. For a 4-core inf1.2xl box, what values would make sense for visible_cores and num_cores with a 100% utilization target?

  • After testing with different values, changing visible_cores to 0-3 and num_cores to 4 (and everything in between) did not make any difference - the same error still occurs.

  • Would you be able to share more details on the model you are attempting to compile? It is very unusual to see a 16 hour compilation time which may indicate that there are other issues occurring here even before executing the model.

    Could you potentially share which model is being used or a proxy model that has similar behavior?

    If this is a fully custom/private model, it could be helpful for us to look at a version of the model with the weights set to zero just to see if there are improvements we could make to the compilation process. If you can email steps/files/instructions for reproduction directly to aws-neuron-support@amazon.com then we can take a look.

  • After some T&E we came to the conclusion that neuron doesn't like to compile/run CV models at higher resolutions. We ended up tiling our inputs which appears to be working much better.

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions