Some questions about compiling a model for inferentia

0

These questions arose while I read the docs and couldn't find the answers.

  1. Does the compiling has to be done on a CPU? Can I use accelerators such as GPUs to speed it up?

  2. Does compiling on an Inf1 instance utilize neuron cores, or a CPU just like other instances?

  3. There seem to be several approaches to using more than one neural core for inference. I found

    • torch.neuron.DataParallel
    • setting os.environ['NEURON_RT_NUM_CORES'] before compiling
    • setting --neuroncore-pipeline-cores in trace function

    Is there any difference?

posta un anno fa538 visualizzazioni
1 Risposta
1
Risposta accettata

I assume you are referring to compiling a model for an Inf1 based instance. If so, here are some answers to your questions:

  1. Compiling has to be done on a CPU? Can I use accelerators such as GPUs to speed it up? [A] Yes, compilation happens on CPU and cannot be sped up with a GPU
  2. Does compiling on an Inf1 instance utilize neuron cores, or a CPU just like other instances? [A] Compiling uses CPU resources and you can use a separate instance (e.g. C5, C6 or z1d instance types) for the fastest compile times
  3. There seem to be several approaches to using more than one neural core for inference. [A] Each of the 3 options use several neuron cores but are used for different scenarios.
  • torch.neuron.DataParallel implements data parallelism at the module level by replicating a single Neuron model on all available NeuronCores and distributing data across the different cores for parallelized inference. It includes features such as Dynamic batching that allows you to use tensor batch sizes that the Neuron model was not originally compiled against. This is necessary because the underlying Inferentia hardware will always execute inferences with the batch size used during compilation. More details can be found here
  • NEURON_RT_NUM_CORES is a runtime environment variable to tell the Neuron Runtime to automatically reserve the specified number of free NeuronCores specified for a particular process. It is useful to set if you have multiple processes attempting to access NeuronCores on your instance so you can control how many each process can access. More details can be found here.
  • neuroncore-pipeline-cores is a compile time setting in order to shard your model across multiple NeuronCores, caching the model parameters in each core’s on-chip memory (cache), and then streaming inference requests across the cores in a pipelined manner. More details can be found here.
profile pictureAWS
ESPERTO
con risposta un anno fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande