SageMaker with multiple models

Question

Customer wants to host multiple DNN models on same SageMaker container due to latency concerns. Customer does not want to spin-up different containers for each model due to network adding additional latency. Thus, my customer asked me a question below -

> Can one SageMaker host more than one model? Each model then share the
> same input and produce different outputs concatenated together?

I answered as below -

Yes. Amazon SageMaker supports you hosting multiple models in several different ways –

1. Using Multi-model Inference endpoints: 
Amazon SageMaker supports serving multiple models from same Inference endpoint. Details can be found [here](1). The sample code can be found [here](2).  Currently, this feature do not support Elastic Inference or serial inference pipelines. Multi-model endpoints also enable time-sharing of memory resources across your models. This works best when the models are fairly similar in size and invocation latency. When this is the case, multi-model endpoints can effectively use instances across all models. If you have models that have significantly higher transactions per second (TPS) or latency requirements, we recommend hosting them on dedicated endpoints. Multi-model endpoints are also well suited to scenarios that can tolerate occasional cold-start-related latency penalties that occur when invoking infrequently used models

2. Using Bring your own algorithm on SageMaker
You can also bring your own container with your own libs and runtime/programming language for serving and training. See the example notebook on how you can bring your own algorithm/container image on sagemaker [here](3)

3. Using Multi-model serving container by using multi-model archive file
      You can find a sample example here [4] for tensorflow serving
 4. If models are called sequentially, the SageMaker inference pipeline allows you to chain up to 5 models called one after the other on the same endpoint
Sagemaker endpoints include optimizations that will save costs, such as (1) 1-click deploy to pre-configured environments for popular ML frameworks with a managed serving stack, (2) autoscaling, (3) model compilation, (4) cost-effective hardware acceleration via Elastic Inference, (5) multi-variant model deployment for testing and overlapped model replacement, (6) multi-AZ backend. It is not necessarily a good idea to have multiple models on same endpoint (unless you have the reasons and requirements I mentioned in Option A above). Having one model per endpoint creates an isolation which has positive benefits on fault tolerance, security and scalability. Please keep in mind that SageMaker works on containers that runs on top of EC2.

[1]https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/

[2]https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_bring_your_own/multi_model_endpoint_bring_your_own.ipynb

[3]https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb

[4]https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#deploying-more-than-one-model-to-your-endpoint

[5]https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html

Am I missing anything? Any other suggestions in terms of other approaches?

Accepted Answer

> Customer does not want to spin-up different containers for each model due to network adding additional latency.

I am assuming this is a pipeline scenario where different models need to be chained.
If so, it's important to keep in mind that all containers in pipeline run on the __same EC2 instance__ so that "inferences run with low latency because the containers are co-located on the same EC2 instances."[1]

Hope this is useful.   
[1] https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html

SageMaker with multiple models

관련 콘텐츠