Some questions about compiling a model for inferentia

Question

These questions arose while I read the docs and couldn't find the answers.

1. Does the compiling has to be done on a CPU? Can I use accelerators such as GPUs to speed it up?

2. Does compiling on an Inf1 instance utilize neuron cores, or a CPU just like other instances?

3. There seem to be several approaches to using more than one neural core for inference. I found
    - `torch.neuron.DataParallel`
    - setting `os.environ['NEURON_RT_NUM_CORES']` before compiling
    - setting `--neuroncore-pipeline-cores` in trace function

Is there any difference?

Accepted Answer

I assume you are referring to compiling a model for an Inf1 based instance. If so, here are some answers to your questions:

1. Compiling has to be done on a CPU?  Can I use accelerators such as GPUs to speed it up? 
*[A] Yes, compilation happens on CPU and cannot be sped up with a GPU*
2. Does compiling on an Inf1 instance utilize neuron cores, or a CPU just like other instances? 
*[A] Compiling uses CPU resources and you can use a separate instance (e.g. C5, C6 or z1d instance types) for the fastest compile times*
3. There seem to be several approaches to using more than one neural core for inference. *[A] Each of the 3 options use several neuron cores but are used for different scenarios.*
* *`torch.neuron.DataParallel` implements data parallelism at the module level by replicating a single Neuron model on all available NeuronCores and distributing data across the different cores for parallelized inference. It includes features such as Dynamic batching that allows you to use tensor batch sizes that the Neuron model was not originally compiled against. This is necessary because the underlying Inferentia hardware will always execute inferences with the batch size used during compilation. More details can be found [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/torch-neuron/torch-neuron-dataparallel-app-note.html?highlight=DataParallel)*
* *`NEURON_RT_NUM_CORES` is a runtime environment variable to tell the Neuron Runtime  to automatically reserve the specified number of free NeuronCores specified for a particular process. It is useful to set if you have multiple processes attempting to access NeuronCores on your instance so you can control how many each process can access. More details can be found [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/perf/neuron-cc/parallel-ncgs.html?highlight=NEURON_RT_NUM_CORES).*
* *`neuroncore-pipeline-cores` is a compile time setting in order to shard your model across multiple NeuronCores, caching the model parameters in each core’s on-chip memory (cache), and then streaming inference requests across the cores in a pipelined manner. More details can be found [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuroncore-pipeline.html?highlight=neuroncore-pipeline-cores).*

Some questions about compiling a model for inferentia

관련 콘텐츠