Random timeouts uploading objects in sagemaker async endpoints

1

Hi,

Our sagemaker endpoints (async mode) are randomly failing. Here is a snippet of the data-log log stream :

2024-03-05T17:40:10.207-05:00	2024-03-05T22:40:05.261:[sagemaker logs] [e2b81384-0551-421b-acaf-70bd09e24287] Inference request succeeded. ModelLatency: 13326437 us, RequestDownloadLatency: 45463 us, ResponseUploadLatency: 206162 us, TimeInBacklog: 7 ms, TotalProcessingTime: 13640 ms

2024-03-05T17:40:22.016-05:00	2024-03-05T22:40:18.545:[sagemaker logs] [2648f5f1-7351-4664-8650-15cbf34afc49] Inference request succeeded. ModelLatency: 26629989 us, RequestDownloadLatency: 14018 us, ResponseUploadLatency: 173714 us, TimeInBacklog: 8 ms, TotalProcessingTime: 26867 ms

2024-03-05T17:40:26.207-05:00	2024-03-05T22:40:21.825:[sagemaker logs] [26b9a2af-296c-40a4-a1ea-bb9fd3895317] Timed out uploading object (bucket: sagemaker-<name>-backplate-staging, key: errors/875cd721-13ae-4478-adcd-e0fa4add5cfc-error.out).

It is truncated a little, but you should have the relevant information. As you can see, those were close in time, because it was 3 times the same request.

If you open the error file in S3, you get this as the content :

Timed out uploading object (bucket: sagemaker-<name>-backplate-staging, key: errors/875cd721-13ae-4478-adcd-e0fa4add5cfc-error.out).

which doesn't help much, obviously.

Looking at the container logs for these generations, all 3 of these calls have the same outputs (our custom logs), which ends with this entry for all of them :

INFO: <ip>:<port> - "POST /invocations HTTP/1.1" 200 OK

We also use XRay, and all of these calls are traced. None of them show any error, and the last entry (parse_output in our case) is present for all of them, with no error.

Here is a screenshot of the metrics of a 1h window around those calls :

Metrics

This also happens on more than 1 of our endpoints.

I am unsure what other information would be relevant to help diagnose the issue, feel free to ask!

Thank you in advance for the help.

John
asked 2 months ago118 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions