Can I host multiple models using Nvidia Triton Business Logic Scripting behind a Multi-Model Endpoint?

0

I'd like to host models which need to use BLS on GPU-backed realtime inference with MME. I'd like to be able to scale to hundreds or thousands of such models behind one endpoint. Will this work out of the box? I know there is support for this for models which don't use BLS, but the documentation is not clear on whether an entire BLS pipeline can be treated as an individual model by the auto-scaler.

Better description of my use case:

The model consists of:

  • Model slice A: weights selected from a set of ~5 options, called once per invocation, nothing ever gets added or removed
  • Model slice B: weights selected from a set of 100s+ options, called in a loop n times per invocation, variants constantly added

So if you take this model alone, it seems to be a good candidate for Triton BLS, where each of the slices is a model instance and slice B instance is called n times in a loop in the pipeline file.

My first thought was to add all the model variants to a single BLS model repo, but I am not sure if this would work with auto-scaling and frequently added new models.

The other possibility is to split it into m BLS pipelines, where m is the number of variants I have for model slice B. This should work with auto-scaling, since the model target should match up with the request destination model, but I am not sure if this is supported (an entire BLS model hierarchy being loaded and unloaded in Sagemaker).

The next best option I guess is to not use MME? And maybe switch to MCE? But this seems like a big loss in performance.

What is the best way to deploy this model using Sagemaker tools?

posta un anno fa100 visualizzazioni
Nessuna risposta

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande