SageMaker with multiple models

0

Customer wants to host multiple DNN models on same SageMaker container due to latency concerns. Customer does not want to spin-up different containers for each model due to network adding additional latency. Thus, my customer asked me a question below -

Can one SageMaker host more than one model? Each model then share the same input and produce different outputs concatenated together?

I answered as below -

Yes. Amazon SageMaker supports you hosting multiple models in several different ways –

  1. Using Multi-model Inference endpoints: Amazon SageMaker supports serving multiple models from same Inference endpoint. Details can be found here. The sample code can be found here. Currently, this feature do not support Elastic Inference or serial inference pipelines. Multi-model endpoints also enable time-sharing of memory resources across your models. This works best when the models are fairly similar in size and invocation latency. When this is the case, multi-model endpoints can effectively use instances across all models. If you have models that have significantly higher transactions per second (TPS) or latency requirements, we recommend hosting them on dedicated endpoints. Multi-model endpoints are also well suited to scenarios that can tolerate occasional cold-start-related latency penalties that occur when invoking infrequently used models

  2. Using Bring your own algorithm on SageMaker You can also bring your own container with your own libs and runtime/programming language for serving and training. See the example notebook on how you can bring your own algorithm/container image on sagemaker here

  3. Using Multi-model serving container by using multi-model archive file You can find a sample example here [4] for tensorflow serving

  4. If models are called sequentially, the SageMaker inference pipeline allows you to chain up to 5 models called one after the other on the same endpoint Sagemaker endpoints include optimizations that will save costs, such as (1) 1-click deploy to pre-configured environments for popular ML frameworks with a managed serving stack, (2) autoscaling, (3) model compilation, (4) cost-effective hardware acceleration via Elastic Inference, (5) multi-variant model deployment for testing and overlapped model replacement, (6) multi-AZ backend. It is not necessarily a good idea to have multiple models on same endpoint (unless you have the reasons and requirements I mentioned in Option A above). Having one model per endpoint creates an isolation which has positive benefits on fault tolerance, security and scalability. Please keep in mind that SageMaker works on containers that runs on top of EC2.

[1]https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/

[2]https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_bring_your_own/multi_model_endpoint_bring_your_own.ipynb

[3]https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb

[4]https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#deploying-more-than-one-model-to-your-endpoint

[5]https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html

Am I missing anything? Any other suggestions in terms of other approaches?

AWS
질문됨 4년 전1834회 조회
1개 답변
0
수락된 답변

Customer does not want to spin-up different containers for each model due to network adding additional latency.

I am assuming this is a pipeline scenario where different models need to be chained. If so, it's important to keep in mind that all containers in pipeline run on the same EC2 instance so that "inferences run with low latency because the containers are co-located on the same EC2 instances."[1]

Hope this is useful.
[1] https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html

AWS
중재자
답변함 4년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인