- Newest
- Most votes
- Most comments
Hello Quan Dang !
The following link refers to SageMaker Model Deployment and Deployment Recommendation: https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html#deploy-model-options
For your problem, for each model, the processing time is not long, request payload is not large, and it’s kind of real-time latency requirement, and there are about 1000 deep learning models and each’s size is ~2GB. Therefore, we eliminate following options: async inference, serverless, batch transform and leaving only 1 option left : real-time inference. In Real-time inference, there are 4 options :
- Host single model - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-single-model.html : Fastest way to get inference from 1 ML model.
- Host multi models in 1 container behind 1 endpoint - https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html : https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html Multi-model endpoints are ideal for hosting a large number of models that use the same ML framework on a shared serving container. If you have a mix of frequently and infrequently accessed models, a multi-model endpoint can efficiently serve this traffic with fewer resources and higher cost savings. Your application should be tolerant of occasional cold start-related latency penalties that occur when invoking infrequently used models.
- Host multi models which use different containers behind 1 endpoint - https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html : SageMaker multi-container endpoints enable customers to deploy multiple containers, that use different models or frameworks, on a single SageMaker endpoint. The containers can be run in a sequence as an inference pipeline, or each container can be accessed individually by using direct invocation to improve endpoint utilization and optimize costs.
- (Eliminated) Host models along with pre-processing logic as serial inference pipeline behind one endpoint - https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html
- Note: What ever option we choose above, we all enable Auto-Scaling feature: https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html
So, we narrow it down to only 3 options for deployment, you can create a survey about your ML Models deployment details (a statistics for the following information of each model : framework, inference latency, GPU usage type).
For those models are frequently accessed (inference latency <~60s) , you can choose “Host Single Model” ; Otherwise, for those aren’t frequently accessed, if they use the same ML framework, choose “Host multi models in 1 container behind 1 endpoint”, if they use different ML framework, choose “Host multi models which use different containers behind 1 endpoint”.
Relevant content
- asked 5 months ago
- Accepted Answerasked 10 months ago
- Accepted Answerasked 3 years ago
- asked 2 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 8 months ago
Great point ! Will do some experiments and let you know the result !