Can I host multiple models using Nvidia Triton Business Logic Scripting behind a Multi-Model Endpoint?

0

I'd like to host models which need to use BLS on GPU-backed realtime inference with MME. I'd like to be able to scale to hundreds or thousands of such models behind one endpoint. Will this work out of the box? I know there is support for this for models which don't use BLS, but the documentation is not clear on whether an entire BLS pipeline can be treated as an individual model by the auto-scaler.

Better description of my use case:

The model consists of:

  • Model slice A: weights selected from a set of ~5 options, called once per invocation, nothing ever gets added or removed
  • Model slice B: weights selected from a set of 100s+ options, called in a loop n times per invocation, variants constantly added

So if you take this model alone, it seems to be a good candidate for Triton BLS, where each of the slices is a model instance and slice B instance is called n times in a loop in the pipeline file.

My first thought was to add all the model variants to a single BLS model repo, but I am not sure if this would work with auto-scaling and frequently added new models.

The other possibility is to split it into m BLS pipelines, where m is the number of variants I have for model slice B. This should work with auto-scaling, since the model target should match up with the request destination model, but I am not sure if this is supported (an entire BLS model hierarchy being loaded and unloaded in Sagemaker).

The next best option I guess is to not use MME? And maybe switch to MCE? But this seems like a big loss in performance.

What is the best way to deploy this model using Sagemaker tools?

질문됨 일 년 전102회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠