Some questions about compiling a model for inferentia

0

These questions arose while I read the docs and couldn't find the answers.

  1. Does the compiling has to be done on a CPU? Can I use accelerators such as GPUs to speed it up?

  2. Does compiling on an Inf1 instance utilize neuron cores, or a CPU just like other instances?

  3. There seem to be several approaches to using more than one neural core for inference. I found

    • torch.neuron.DataParallel
    • setting os.environ['NEURON_RT_NUM_CORES'] before compiling
    • setting --neuroncore-pipeline-cores in trace function

    Is there any difference?

asked a year ago524 views
1 Answer
1
Accepted Answer

I assume you are referring to compiling a model for an Inf1 based instance. If so, here are some answers to your questions:

  1. Compiling has to be done on a CPU? Can I use accelerators such as GPUs to speed it up? [A] Yes, compilation happens on CPU and cannot be sped up with a GPU
  2. Does compiling on an Inf1 instance utilize neuron cores, or a CPU just like other instances? [A] Compiling uses CPU resources and you can use a separate instance (e.g. C5, C6 or z1d instance types) for the fastest compile times
  3. There seem to be several approaches to using more than one neural core for inference. [A] Each of the 3 options use several neuron cores but are used for different scenarios.
  • torch.neuron.DataParallel implements data parallelism at the module level by replicating a single Neuron model on all available NeuronCores and distributing data across the different cores for parallelized inference. It includes features such as Dynamic batching that allows you to use tensor batch sizes that the Neuron model was not originally compiled against. This is necessary because the underlying Inferentia hardware will always execute inferences with the batch size used during compilation. More details can be found here
  • NEURON_RT_NUM_CORES is a runtime environment variable to tell the Neuron Runtime to automatically reserve the specified number of free NeuronCores specified for a particular process. It is useful to set if you have multiple processes attempting to access NeuronCores on your instance so you can control how many each process can access. More details can be found here.
  • neuroncore-pipeline-cores is a compile time setting in order to shard your model across multiple NeuronCores, caching the model parameters in each core’s on-chip memory (cache), and then streaming inference requests across the cores in a pipelined manner. More details can be found here.
profile pictureAWS
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions