I have an inference endpoint that returns a HTTP streaming response and I would like to load test it.
Does ModelLatency in the recommender metrics refer to time to receive the first chunk, or time to receive all chunks?
ModelLatency
c.f. https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-interpret-results.html
The following links may help you understand ModelLatency in more detail. https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/ and https://repost.aws/knowledge-center/sagemaker-endpoint-latency particularly note how ModelLatency and OverheadLatency are defined.
로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.
좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.
The following links may help you understand ModelLatency in more detail. https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/ and https://repost.aws/knowledge-center/sagemaker-endpoint-latency particularly note how ModelLatency and OverheadLatency are defined.