SageMaker Inference Recommendation

0

We have an application that processes customer survey responses to determine the overall sentiment of it. (negative, neutral, positive). We are leveraging SageMaker Sentiment analysis for this.

Below are some key datapoints of our current usage.

  1. We have 4 real time endpoints each having the below properties
    1. Multi Model:
      1. Model A is 6 GB
      2. Model B is 9.4 GB
    2. Runtime configuration:
      1. ml.c5.2xlarge (8 vCPU, 16 GiB memory)
    3. Image responsible for handling the inference is of size 4.5gb
  2. The reason we have 4 endpoints is so that we can make concurrent requests to SageMaker. For this we have our load balancing logic that determines which endpoint to call.

This pattern will not continue working for us during 2023 as we scale up our survey ingestion pipeline. From running a load test of our system, the maximum TPS to Sagemaker we can support currently is 5.333; anything beyond that causes SageMaker to send 5XX responses (because we max out the CPU on all the available cores). Furthermore, our call pattern is of batch so there is no need for us to have the endpoints running 24/7.

For 2023, we have predicted that we want to start handling a TPS of up to 200 during peaks. Before we start designing a new workflow to support this TPS, we wanted to get the SageMaker’s teams/community's feedback on alternatives. While we could try to scale vertically and horizontally this setup, we don’t believe deploying bigger hosts and creating more realtime endpoints is the right solution in terms of total costs (given our intermittent batch call pattern to Sagemaker). Specifically, we want to know which Inference Flavor would better fit our use case based on the above description of the system.

asked a year ago501 views
1 Answer
0

Hi User-8002955,

Have you tried using the SageMaker managed Autoscaling capability? Instead of having 4 endpoints and implementing the load balancing yourselves, SageMaker can implement an autoscaling policy and do the load balancing for you (at no additional cost). Using an autoscaling policy means that during peak hours, SageMaker will scale out (scale horizontally) your endpoint (add more instances) and when traffic is low again, it can scale in (remove excess instances).

If your access pattern is purely batch (you have all the data you want to inference available at the same time) you can use SageMaker Batch transform https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html . If this fits your needs it can probably be the most cost efficient as you are only being charged for the number of seconds that the transform job was running for (and not for availability time)

The 3rd option to have a look at is Async Endpoints https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html Async endpoints are similar to real-time in the sense that you pass in the data for inference one by one as the requests come in. You can also leverage the autoscaling capability mentioned above, with the addition that if there is no traffic at all at a given time period, the endpoint can scale to 0. When making an inference request, the endpoint would reply immediately that it "got the request", will save it in a queue for processing and will write the result in the defined S3 location for you to access, and optionally may notify you that the inference request is ready.

Please note that 2nd and 3rd option above do not implement multi-model, so your application would need to handle which of the 2 models should be used and either trigger a different Batch job or use a different async endpoint.

Based on your application needs, all 3 of the above options can scale to cover your needs of 200TPS or above. It all depends on what is the right access pattern for your use-case. Hope the above helped

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions