async inference docker restart after less than 20 minutes, not helpful log found

0

i have a async inference on SageMaker, with BYOC. The job may take about 20 minutes and more. And i already set InvocationTimeoutSeconds to 3600 seconds.
the problem is, when i start a new inference request, from CloudWatch i know the job is in progress, and there is not /ping request log in CloudWatch. but the after about 10 minute , /ping log in CloudWatch show up again with error, which says service unavailable.
then after 6 minute, i found a new log stream in CloudWatch, and the older one is down.
here is the log in CloudWatch:

...(/ping log, until i send a request)

2023-05-17T16:12:15.761+08:00	task type:file ( my job start)
2023-05-17T16:22:58.223+08:00.     [error] 31#31: *389 connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.180.2:8080"
2023-05-17T16:23:02.761+08:00	169.254.178.2 - - [17/May/2023:08:22:58 +0000] "GET /ping HTTP/1.1" 502 166 "-" "AHC/2.0"

...(the error and /ping repeat for 6 minute)

2023-05-17T16:28:58.133+08:00    [error] 31#31: *449 connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.180.2:8080"

how can i fix it?

已提問 1 年前檢視次數 260 次
1 個回答
0
已接受的答案

If I understand your log snippets correctly, it looks like your container is failing to respond to any /pings while processing the long-running request? Failing to respond to ping for an extended period indicates your endpoint is unhealthy so will signal SageMaker to restart the container.

A likely reason for not responding might be if your request handling uses multi-processing in a way that maxes out all CPUs on the instance? This would leave no cores/threads available to handle to incoming pings while the data is getting processed. In that case, the fix would be to identify what component(s) of your request handling might be using all available system cores at once, and re-configuring them to use int(os.environ["SM_NUM_CPUS"]) - 1 instead.

A similar but less likely reason is if for some reason you're using a fully-custom serving stack or have explicitly re-configured the default one to have only one worker thread: In which case your main request handling might be blocking the server with no threads available to pick up concurrent pings (even though there are CPU resources)?

AWS
專家
Alex_T
已回答 1 年前
  • You are right, the problem had been solved. i use a custom docker, which modified from a AWA SageMaker example of real time inference (not async inference). It use gunicorn with one worker and one thread, reason is written in the file named "serve":

    # for our GPU based inference, set to one.  This process is GPU bound, and the GPU may run out of space if more than one model is loaded.
    model_server_workers = 1
    

    After digging all day, I realize it maybe the root cause. I change configure to make gunicorn run with 2 threads, finally it works.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南