Random timeouts uploading objects in sagemaker async endpoints

1

Hi,

Our sagemaker endpoints (async mode) are randomly failing. Here is a snippet of the data-log log stream :

2024-03-05T17:40:10.207-05:00	2024-03-05T22:40:05.261:[sagemaker logs] [e2b81384-0551-421b-acaf-70bd09e24287] Inference request succeeded. ModelLatency: 13326437 us, RequestDownloadLatency: 45463 us, ResponseUploadLatency: 206162 us, TimeInBacklog: 7 ms, TotalProcessingTime: 13640 ms

2024-03-05T17:40:22.016-05:00	2024-03-05T22:40:18.545:[sagemaker logs] [2648f5f1-7351-4664-8650-15cbf34afc49] Inference request succeeded. ModelLatency: 26629989 us, RequestDownloadLatency: 14018 us, ResponseUploadLatency: 173714 us, TimeInBacklog: 8 ms, TotalProcessingTime: 26867 ms

2024-03-05T17:40:26.207-05:00	2024-03-05T22:40:21.825:[sagemaker logs] [26b9a2af-296c-40a4-a1ea-bb9fd3895317] Timed out uploading object (bucket: sagemaker-<name>-backplate-staging, key: errors/875cd721-13ae-4478-adcd-e0fa4add5cfc-error.out).

It is truncated a little, but you should have the relevant information. As you can see, those were close in time, because it was 3 times the same request.

If you open the error file in S3, you get this as the content :

Timed out uploading object (bucket: sagemaker-<name>-backplate-staging, key: errors/875cd721-13ae-4478-adcd-e0fa4add5cfc-error.out).

which doesn't help much, obviously.

Looking at the container logs for these generations, all 3 of these calls have the same outputs (our custom logs), which ends with this entry for all of them :

INFO: <ip>:<port> - "POST /invocations HTTP/1.1" 200 OK

We also use XRay, and all of these calls are traced. None of them show any error, and the last entry (parse_output in our case) is present for all of them, with no error.

Here is a screenshot of the metrics of a 1h window around those calls :

Metrics

This also happens on more than 1 of our endpoints.

I am unsure what other information would be relevant to help diagnose the issue, feel free to ask!

Thank you in advance for the help.

John
질문됨 2달 전133회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠