Can I host multiple models using Nvidia Triton Business Logic Scripting behind a Multi-Model Endpoint?

0

I'd like to host models which need to use BLS on GPU-backed realtime inference with MME. I'd like to be able to scale to hundreds or thousands of such models behind one endpoint. Will this work out of the box? I know there is support for this for models which don't use BLS, but the documentation is not clear on whether an entire BLS pipeline can be treated as an individual model by the auto-scaler.

Better description of my use case:

The model consists of:

  • Model slice A: weights selected from a set of ~5 options, called once per invocation, nothing ever gets added or removed
  • Model slice B: weights selected from a set of 100s+ options, called in a loop n times per invocation, variants constantly added

So if you take this model alone, it seems to be a good candidate for Triton BLS, where each of the slices is a model instance and slice B instance is called n times in a loop in the pipeline file.

My first thought was to add all the model variants to a single BLS model repo, but I am not sure if this would work with auto-scaling and frequently added new models.

The other possibility is to split it into m BLS pipelines, where m is the number of variants I have for model slice B. This should work with auto-scaling, since the model target should match up with the request destination model, but I am not sure if this is supported (an entire BLS model hierarchy being loaded and unloaded in Sagemaker).

The next best option I guess is to not use MME? And maybe switch to MCE? But this seems like a big loss in performance.

What is the best way to deploy this model using Sagemaker tools?

asked a year ago98 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions