I have an inference endpoint that returns a HTTP streaming response and I would like to load test it.
Does ModelLatency in the recommender metrics refer to time to receive the first chunk, or time to receive all chunks?
ModelLatency
c.f. https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-interpret-results.html
The following links may help you understand ModelLatency in more detail. https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/ and https://repost.aws/knowledge-center/sagemaker-endpoint-latency particularly note how ModelLatency and OverheadLatency are defined.
您未登录。 登录 发布回答。
一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。
The following links may help you understand ModelLatency in more detail. https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/ and https://repost.aws/knowledge-center/sagemaker-endpoint-latency particularly note how ModelLatency and OverheadLatency are defined.