I have an inference endpoint that returns a HTTP streaming response and I would like to load test it.
Does ModelLatency in the recommender metrics refer to time to receive the first chunk, or time to receive all chunks?
ModelLatency
c.f. https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-interpret-results.html
The following links may help you understand ModelLatency in more detail. https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/ and https://repost.aws/knowledge-center/sagemaker-endpoint-latency particularly note how ModelLatency and OverheadLatency are defined.
您尚未登入。 登入 去張貼答案。
一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。
The following links may help you understand ModelLatency in more detail. https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/ and https://repost.aws/knowledge-center/sagemaker-endpoint-latency particularly note how ModelLatency and OverheadLatency are defined.