SageMaker Endpoints are managed hosted solution which has layered routing internally. You could gain some additional insights into the system with the OverheadLatency metric. You will also see the ModelLatency metric which will show what the time taken by customer model itself -- https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html.
What kind of TPS are you driving against your Endpoint? (In a low request rate situation, it is possible that the caches on our side are not kept warm and you might observe increased latencies). I believe the above metrics will help you with more details.
Thank you for suggesting the Cloudwatch metrics. I just checked the Overhead Latency and it lines up pretty closely with the average timings I got in a Sagemaker Jupyter Notebook using the Python client SDK to invoke the endpoint.
Currently, I am expecting about 2.5 transactions per second during peak usage and 0.5 TPS during low traffic hours. The peak number will hopefully go up to around 10 TPS later this year.
The cache explanation makes a lot of sense. I've experimented with a few different TPS settings. There is pretty consistently some spiking of timings to start off but then it levels off to around 20 ms for higher TPS. In lower TPS experiments I've noticed that the timings remain a little unpredictable.
Thank you again for your help!
- asked a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 4 months ago
- EXPERTpublished 4 months ago