aws sagemaker real-time endpoint, does not process requests concurrently?

0

I deployed stable diffusion v2.0 by aws sagemaker, and create endpoint for real-time inference.(instance type is ml.g4dn.xlarge) Also, i used aws apigateway and aws lambda.

my question is that concurrency of prediction process by invoked request. when i check cloudwatch log, i see that request is sequentially process.(one prediction finished, then next prediction proceeded)

i expected that the requests are concurrently dealt with, but not.

at real-time endpoint, there is no max concurrent options, then invokeendpoint always sequentially procedeed?? no way for making requests are parellely dealt with, except increase instance number?

1개 답변
0

The concurrency of a real-time endpoint depends on the number of workers maintained inside your algorithm container. For each worker, a copy of the model weights need to be loaded. In other words, we need to first configure the container to maintain multiple workers and make sure there is enough CPU & GPU memory to host multiple models.

I think for stable diffusion, the officially recommended GPU memory is 10 GB and g4dn.xlarge only comes with 16, which is not sufficient for 2 models running concurrently ?

Could you please check the runtime GPU utilization as well as the configured number of workers in your container ?

AWS
답변함 일 년 전
  • thanks, i'll check that is it possible concurrently process requests by using other instance type having more GPU memory.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인