What factors affect Sagemaker endpoint response time?

0

I'm using Sagemaker as part of a planned deployment of a XGBoost model to production where it will be called by a customer facing mobile app (via another back end service that we also have hosted in AWS).

I would like to understand how to improve response times. I have tested the response time of my model both when it resides locally on my own dev machine, as well as when its running in Sagemaker.

The wall time for local atomic predictions takes about 1 ms at 50p and 7 ms at 99p.

The wall time for atomic predictions (using the Python client SDK in a sagemaker notebook) takes about 20 ms at 50p and 25 ms at 99p. However, there are outliers that take as long as ~300ms.

I am curious to know what factors affect the performance of Sagemaker calls (other than the complexity of the model itself). And I would be very grateful for any tips to get our outliers lower (preferably around 50 ms if possible).

질문됨 5년 전1628회 조회
2개 답변
0

Hi bradmitchell,
SageMaker Endpoints are managed hosted solution which has layered routing internally. You could gain some additional insights into the system with the OverheadLatency metric. You will also see the ModelLatency metric which will show what the time taken by customer model itself -- https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html.

What kind of TPS are you driving against your Endpoint? (In a low request rate situation, it is possible that the caches on our side are not kept warm and you might observe increased latencies). I believe the above metrics will help you with more details.

Thank you,
Arun

AWS
답변함 5년 전
0

Hi Arun,

Thank you for suggesting the Cloudwatch metrics. I just checked the Overhead Latency and it lines up pretty closely with the average timings I got in a Sagemaker Jupyter Notebook using the Python client SDK to invoke the endpoint.

Currently, I am expecting about 2.5 transactions per second during peak usage and 0.5 TPS during low traffic hours. The peak number will hopefully go up to around 10 TPS later this year.

The cache explanation makes a lot of sense. I've experimented with a few different TPS settings. There is pretty consistently some spiking of timings to start off but then it levels off to around 20 ms for higher TPS. In lower TPS experiments I've noticed that the timings remain a little unpredictable.

Thank you again for your help!

Brad

답변함 5년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠