What factors affect Sagemaker endpoint response time?


I'm using Sagemaker as part of a planned deployment of a XGBoost model to production where it will be called by a customer facing mobile app (via another back end service that we also have hosted in AWS).

I would like to understand how to improve response times. I have tested the response time of my model both when it resides locally on my own dev machine, as well as when its running in Sagemaker.

The wall time for local atomic predictions takes about 1 ms at 50p and 7 ms at 99p.

The wall time for atomic predictions (using the Python client SDK in a sagemaker notebook) takes about 20 ms at 50p and 25 ms at 99p. However, there are outliers that take as long as ~300ms.

I am curious to know what factors affect the performance of Sagemaker calls (other than the complexity of the model itself). And I would be very grateful for any tips to get our outliers lower (preferably around 50 ms if possible).

asked 5 years ago1797 views
2 Answers

Hi bradmitchell,
SageMaker Endpoints are managed hosted solution which has layered routing internally. You could gain some additional insights into the system with the OverheadLatency metric. You will also see the ModelLatency metric which will show what the time taken by customer model itself -- https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html.

What kind of TPS are you driving against your Endpoint? (In a low request rate situation, it is possible that the caches on our side are not kept warm and you might observe increased latencies). I believe the above metrics will help you with more details.

Thank you,

answered 5 years ago

Hi Arun,

Thank you for suggesting the Cloudwatch metrics. I just checked the Overhead Latency and it lines up pretty closely with the average timings I got in a Sagemaker Jupyter Notebook using the Python client SDK to invoke the endpoint.

Currently, I am expecting about 2.5 transactions per second during peak usage and 0.5 TPS during low traffic hours. The peak number will hopefully go up to around 10 TPS later this year.

The cache explanation makes a lot of sense. I've experimented with a few different TPS settings. There is pretty consistently some spiking of timings to start off but then it levels off to around 20 ms for higher TPS. In lower TPS experiments I've noticed that the timings remain a little unpredictable.

Thank you again for your help!


answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions