SageMaker with multiple models

0

Customer wants to host multiple DNN models on same SageMaker container due to latency concerns. Customer does not want to spin-up different containers for each model due to network adding additional latency. Thus, my customer asked me a question below -

Can one SageMaker host more than one model? Each model then share the same input and produce different outputs concatenated together?

I answered as below -

Yes. Amazon SageMaker supports you hosting multiple models in several different ways –

  1. Using Multi-model Inference endpoints: Amazon SageMaker supports serving multiple models from same Inference endpoint. Details can be found here. The sample code can be found here. Currently, this feature do not support Elastic Inference or serial inference pipelines. Multi-model endpoints also enable time-sharing of memory resources across your models. This works best when the models are fairly similar in size and invocation latency. When this is the case, multi-model endpoints can effectively use instances across all models. If you have models that have significantly higher transactions per second (TPS) or latency requirements, we recommend hosting them on dedicated endpoints. Multi-model endpoints are also well suited to scenarios that can tolerate occasional cold-start-related latency penalties that occur when invoking infrequently used models

  2. Using Bring your own algorithm on SageMaker You can also bring your own container with your own libs and runtime/programming language for serving and training. See the example notebook on how you can bring your own algorithm/container image on sagemaker here

  3. Using Multi-model serving container by using multi-model archive file You can find a sample example here [4] for tensorflow serving

  4. If models are called sequentially, the SageMaker inference pipeline allows you to chain up to 5 models called one after the other on the same endpoint Sagemaker endpoints include optimizations that will save costs, such as (1) 1-click deploy to pre-configured environments for popular ML frameworks with a managed serving stack, (2) autoscaling, (3) model compilation, (4) cost-effective hardware acceleration via Elastic Inference, (5) multi-variant model deployment for testing and overlapped model replacement, (6) multi-AZ backend. It is not necessarily a good idea to have multiple models on same endpoint (unless you have the reasons and requirements I mentioned in Option A above). Having one model per endpoint creates an isolation which has positive benefits on fault tolerance, security and scalability. Please keep in mind that SageMaker works on containers that runs on top of EC2.

[1]https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/

[2]https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_bring_your_own/multi_model_endpoint_bring_your_own.ipynb

[3]https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb

[4]https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#deploying-more-than-one-model-to-your-endpoint

[5]https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html

Am I missing anything? Any other suggestions in terms of other approaches?

AWS
已提问 4 年前1833 查看次数
1 回答
0
已接受的回答

Customer does not want to spin-up different containers for each model due to network adding additional latency.

I am assuming this is a pipeline scenario where different models need to be chained. If so, it's important to keep in mind that all containers in pipeline run on the same EC2 instance so that "inferences run with low latency because the containers are co-located on the same EC2 instances."[1]

Hope this is useful.
[1] https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html

AWS
审核人员
已回答 4 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则