Some questions about compiling a model for inferentia

0

These questions arose while I read the docs and couldn't find the answers.

  1. Does the compiling has to be done on a CPU? Can I use accelerators such as GPUs to speed it up?

  2. Does compiling on an Inf1 instance utilize neuron cores, or a CPU just like other instances?

  3. There seem to be several approaches to using more than one neural core for inference. I found

    • torch.neuron.DataParallel
    • setting os.environ['NEURON_RT_NUM_CORES'] before compiling
    • setting --neuroncore-pipeline-cores in trace function

    Is there any difference?

질문됨 일 년 전538회 조회
1개 답변
1
수락된 답변

I assume you are referring to compiling a model for an Inf1 based instance. If so, here are some answers to your questions:

  1. Compiling has to be done on a CPU? Can I use accelerators such as GPUs to speed it up? [A] Yes, compilation happens on CPU and cannot be sped up with a GPU
  2. Does compiling on an Inf1 instance utilize neuron cores, or a CPU just like other instances? [A] Compiling uses CPU resources and you can use a separate instance (e.g. C5, C6 or z1d instance types) for the fastest compile times
  3. There seem to be several approaches to using more than one neural core for inference. [A] Each of the 3 options use several neuron cores but are used for different scenarios.
  • torch.neuron.DataParallel implements data parallelism at the module level by replicating a single Neuron model on all available NeuronCores and distributing data across the different cores for parallelized inference. It includes features such as Dynamic batching that allows you to use tensor batch sizes that the Neuron model was not originally compiled against. This is necessary because the underlying Inferentia hardware will always execute inferences with the batch size used during compilation. More details can be found here
  • NEURON_RT_NUM_CORES is a runtime environment variable to tell the Neuron Runtime to automatically reserve the specified number of free NeuronCores specified for a particular process. It is useful to set if you have multiple processes attempting to access NeuronCores on your instance so you can control how many each process can access. More details can be found here.
  • neuroncore-pipeline-cores is a compile time setting in order to shard your model across multiple NeuronCores, caching the model parameters in each core’s on-chip memory (cache), and then streaming inference requests across the cores in a pipelined manner. More details can be found here.
profile pictureAWS
전문가
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠