aws sagemaker real-time endpoint, does not process requests concurrently?

0

I deployed stable diffusion v2.0 by aws sagemaker, and create endpoint for real-time inference.(instance type is ml.g4dn.xlarge) Also, i used aws apigateway and aws lambda.

my question is that concurrency of prediction process by invoked request. when i check cloudwatch log, i see that request is sequentially process.(one prediction finished, then next prediction proceeded)

i expected that the requests are concurrently dealt with, but not.

at real-time endpoint, there is no max concurrent options, then invokeendpoint always sequentially procedeed?? no way for making requests are parellely dealt with, except increase instance number?

1 réponse
0

The concurrency of a real-time endpoint depends on the number of workers maintained inside your algorithm container. For each worker, a copy of the model weights need to be loaded. In other words, we need to first configure the container to maintain multiple workers and make sure there is enough CPU & GPU memory to host multiple models.

I think for stable diffusion, the officially recommended GPU memory is 10 GB and g4dn.xlarge only comes with 16, which is not sufficient for 2 models running concurrently ?

Could you please check the runtime GPU utilization as well as the configured number of workers in your container ?

AWS
répondu il y a un an
  • thanks, i'll check that is it possible concurrently process requests by using other instance type having more GPU memory.

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions