Running Inferentia Models via Gunicorn

0

I have compiled my model to run on Inferentia and I can load up multiple models from 1 process such as a single jupyter notebook.

I am trying to host the models via a server and am using gunicorn as the interface. When I specify gunicorn to use anything more than 1 worker, the process crashes and I receive an error like such:

2022-Aug-16 00:51:15.0842 22127:22127 ERROR   NRT:nrt_allocate_neuron_cores               NeuronCore(s) not available - Requested:16 Available:0

Gunicorn works with 1 parent process and the number of threads specifies the child processes so, in this case, there are multiple child processes that would like to use 1 core each.

I would like to know if there is any way in which I can have all of the cores utilized by multiple child processes. If there is any documentation around this or a potential solution that may work, that would be greatly appreciated.

질문됨 2년 전459회 조회
2개 답변
0

I have found the solution to make this happen.

In the app.py file and in the file where you run the torch.jit.load(assuming it is different from your app.py file) , set the following parameters:

import os
os.environ['NEURON_RT_NUM_CORES'] = '1'

This tells each child process in gunicorn to use 1 neuron core each and hence you can run X number of workers where X is the number of cores that are on the device you are running your code on.

답변함 2년 전
0

Inferentia is compatible with FastAPI. The error suggests the program is asking to allocate more cores than available. As an example, lets assume the instance is inf1.6xl, which has 16 Neuron Cores. Below should be your gunicorn command:

gunicorn main-fastapi-demo:app —workers 4 —worker-class uvicorn.workers.UvicornWorker —bind 0.0.0.0:8001

and in your server code main-fastapi-demo.py, make sure you add environment variable after your import statements:

NUM_CORES = 4
os.environ['NEURON_RT_NUM_CORES'] = str(NUM_CORES)

Taken together, this means that you will invoke four gunicorn workers, each worker gets four Neuron Cores. So total of (4 x 4 = 16) 16 Neuron Cores are allocated to your server process.

You may mix and match these parameters, it doesn't have to be 4 x 4, it may be 8 x 2, 2 x 8. The best combination is determined by benchmarking effort.

AWS
KCT
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠