Random timeouts uploading objects in sagemaker async endpoints

1

Hi,

Our sagemaker endpoints (async mode) are randomly failing. Here is a snippet of the data-log log stream :

2024-03-05T17:40:10.207-05:00	2024-03-05T22:40:05.261:[sagemaker logs] [e2b81384-0551-421b-acaf-70bd09e24287] Inference request succeeded. ModelLatency: 13326437 us, RequestDownloadLatency: 45463 us, ResponseUploadLatency: 206162 us, TimeInBacklog: 7 ms, TotalProcessingTime: 13640 ms

2024-03-05T17:40:22.016-05:00	2024-03-05T22:40:18.545:[sagemaker logs] [2648f5f1-7351-4664-8650-15cbf34afc49] Inference request succeeded. ModelLatency: 26629989 us, RequestDownloadLatency: 14018 us, ResponseUploadLatency: 173714 us, TimeInBacklog: 8 ms, TotalProcessingTime: 26867 ms

2024-03-05T17:40:26.207-05:00	2024-03-05T22:40:21.825:[sagemaker logs] [26b9a2af-296c-40a4-a1ea-bb9fd3895317] Timed out uploading object (bucket: sagemaker-<name>-backplate-staging, key: errors/875cd721-13ae-4478-adcd-e0fa4add5cfc-error.out).

It is truncated a little, but you should have the relevant information. As you can see, those were close in time, because it was 3 times the same request.

If you open the error file in S3, you get this as the content :

Timed out uploading object (bucket: sagemaker-<name>-backplate-staging, key: errors/875cd721-13ae-4478-adcd-e0fa4add5cfc-error.out).

which doesn't help much, obviously.

Looking at the container logs for these generations, all 3 of these calls have the same outputs (our custom logs), which ends with this entry for all of them :

INFO: <ip>:<port> - "POST /invocations HTTP/1.1" 200 OK

We also use XRay, and all of these calls are traced. None of them show any error, and the last entry (parse_output in our case) is present for all of them, with no error.

Here is a screenshot of the metrics of a 1h window around those calls :

Metrics

This also happens on more than 1 of our endpoints.

I am unsure what other information would be relevant to help diagnose the issue, feel free to ask!

Thank you in advance for the help.

Nessuna risposta

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande