How to keep sagemaker inference warm-up

0

Calling sagemaker inference frequently (3-5 calls in a minute) reduces runtime duration from ~200ms to ~50ms, so it seems there is similar warm-up behaviour like in Lambda. Do you have any suggestions how to keep sagemaker inference responsive always fast?

질문됨 일 년 전1028회 조회
1개 답변
0

You may need to check where this acceleration comes from to determine the warm up process. In CloudWatch metrics, you have ModelLatency and OverheadLatency.

SageMaker Endpoint has a front-end router which maintains some caches for meta data and credentials. If the requests are frequent enough, the cache will be retained and auto renewed. This will reduce the OverheadLatency.

If you see a big drop in ModeLatency with warm-up requests, this may mean your algorithm container could have been configured to retrain some temporary data longer.

Normally, you could schedule an invocation Lambda with CloudWatch Alarms to target tracking the metricInvokationPerInstance. This will make sure you always maintain a certain invocation rate when idle and those fake requests could settle down when real requests are picking up.

The issue with warm-up is that we stops the normal auto-scaling process of endpoints. The endpoint may not scale down properly.

AWS
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠