Random timeouts uploading objects in sagemaker async endpoints

1

Hi,

Our sagemaker endpoints (async mode) are randomly failing. Here is a snippet of the data-log log stream :

2024-03-05T17:40:10.207-05:00	2024-03-05T22:40:05.261:[sagemaker logs] [e2b81384-0551-421b-acaf-70bd09e24287] Inference request succeeded. ModelLatency: 13326437 us, RequestDownloadLatency: 45463 us, ResponseUploadLatency: 206162 us, TimeInBacklog: 7 ms, TotalProcessingTime: 13640 ms

2024-03-05T17:40:22.016-05:00	2024-03-05T22:40:18.545:[sagemaker logs] [2648f5f1-7351-4664-8650-15cbf34afc49] Inference request succeeded. ModelLatency: 26629989 us, RequestDownloadLatency: 14018 us, ResponseUploadLatency: 173714 us, TimeInBacklog: 8 ms, TotalProcessingTime: 26867 ms

2024-03-05T17:40:26.207-05:00	2024-03-05T22:40:21.825:[sagemaker logs] [26b9a2af-296c-40a4-a1ea-bb9fd3895317] Timed out uploading object (bucket: sagemaker-<name>-backplate-staging, key: errors/875cd721-13ae-4478-adcd-e0fa4add5cfc-error.out).

It is truncated a little, but you should have the relevant information. As you can see, those were close in time, because it was 3 times the same request.

If you open the error file in S3, you get this as the content :

Timed out uploading object (bucket: sagemaker-<name>-backplate-staging, key: errors/875cd721-13ae-4478-adcd-e0fa4add5cfc-error.out).

which doesn't help much, obviously.

Looking at the container logs for these generations, all 3 of these calls have the same outputs (our custom logs), which ends with this entry for all of them :

INFO: <ip>:<port> - "POST /invocations HTTP/1.1" 200 OK

We also use XRay, and all of these calls are traced. None of them show any error, and the last entry (parse_output in our case) is present for all of them, with no error.

Here is a screenshot of the metrics of a 1h window around those calls :

Metrics

This also happens on more than 1 of our endpoints.

I am unsure what other information would be relevant to help diagnose the issue, feel free to ask!

Thank you in advance for the help.

John
質問済み 2ヶ月前132ビュー
回答なし

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ