I have an inference endpoint that returns a HTTP streaming response and I would like to load test it.
Does ModelLatency in the recommender metrics refer to time to receive the first chunk, or time to receive all chunks?
ModelLatency
c.f. https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-interpret-results.html
The following links may help you understand ModelLatency in more detail. https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/ and https://repost.aws/knowledge-center/sagemaker-endpoint-latency particularly note how ModelLatency and OverheadLatency are defined.
Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.
Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.
The following links may help you understand ModelLatency in more detail. https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/ and https://repost.aws/knowledge-center/sagemaker-endpoint-latency particularly note how ModelLatency and OverheadLatency are defined.