When I invoke my Amazon SageMaker endpoint, I experience high latency.
Use Amazon CloudWatch to monitor the latency metrics ModelLatency and OverheadLatency for a SageMaker endpoint that serves a single model.
- ModelLatency is the amount of time that a model takes to respond to an inference request, as viewed by SageMaker. This duration includes the local communication time for the model to send the request and fetch the response. It also includes the completion time of the inference inside the model container.
- OverheadLatency is the amount of time that SageMaker takes to respond to an invocation request with overheads. This measurement lasts from when SageMaker receives a request until it returns a response, minus ModelLatency.
When you use a SageMaker multi-model endpoint, the following additional metrics are available in CloudWatch:
- ModelLoadingWaitTime: The amount of time that an invocation request waits for the target model to download or load, before performing inference.
- ModelDownloadingTime: The amount of time to download the model from Amazon Simple Storage Service (Amazon S3).
- ModelLoadingTime: The amount of time to load the model from the container.
- ModelCacheHit: The number of InvokeEndpoint requests that are sent to the endpoint where the model previously loaded.
Multi-model endpoints load and unload models throughout their lifetime. You can use the LoadedModelCount CloudWatch metric to view the number of loaded models for an endpoint.
To reduce this latency, take any of the following actions:
- Benchmark the model outside of a SageMaker endpoint to test performance.
- If SageMaker Neo supports your model, then you can compile the model. SageMaker Neo optimizes models to run up to twice as fast with less than a tenth of the memory footprint with no loss in accuracy.
- If AWS Inferentia supports your model, then you can compile the model for Inferentia. This offers up to three times higher throughput and up to 45% lower cost per inference compared to the AWS GPU-based instances.
- If you use a CPU instance and the model supports GPU acceleration, then use a GPU instance to add GPU acceleration to an instance.
Note: The inference code might affect the model latency depending on how the code handles the inference. Any delays in code increase the latency.
- An overused endpoint might cause higher model latency. To dynamically increase and decrease the number of instances that are available for an endpoint, add auto scaling to an endpoint.
Multiple factors might contribute to OverheadLatency. These factors include the payload size for request and responses, request frequency, and the authentication or authorization of the request.
The first invocation for an endpoint might have an increase in latency because of a cold start. This is expected with the first invocation requests. To avoid this issue, send test requests to the endpoint to pre-warm it. Note that infrequent requests might also lead to an increase in OverheadLatency.