1 Answer
- Newest
- Most votes
- Most comments
1
I assume you are referring to compiling a model for an Inf1 based instance. If so, here are some answers to your questions:
- Compiling has to be done on a CPU? Can I use accelerators such as GPUs to speed it up? [A] Yes, compilation happens on CPU and cannot be sped up with a GPU
- Does compiling on an Inf1 instance utilize neuron cores, or a CPU just like other instances? [A] Compiling uses CPU resources and you can use a separate instance (e.g. C5, C6 or z1d instance types) for the fastest compile times
- There seem to be several approaches to using more than one neural core for inference. [A] Each of the 3 options use several neuron cores but are used for different scenarios.
torch.neuron.DataParallel
implements data parallelism at the module level by replicating a single Neuron model on all available NeuronCores and distributing data across the different cores for parallelized inference. It includes features such as Dynamic batching that allows you to use tensor batch sizes that the Neuron model was not originally compiled against. This is necessary because the underlying Inferentia hardware will always execute inferences with the batch size used during compilation. More details can be found hereNEURON_RT_NUM_CORES
is a runtime environment variable to tell the Neuron Runtime to automatically reserve the specified number of free NeuronCores specified for a particular process. It is useful to set if you have multiple processes attempting to access NeuronCores on your instance so you can control how many each process can access. More details can be found here.neuroncore-pipeline-cores
is a compile time setting in order to shard your model across multiple NeuronCores, caching the model parameters in each core’s on-chip memory (cache), and then streaming inference requests across the cores in a pipelined manner. More details can be found here.
Relevant content
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 7 months ago