Sagemaker Batch Transform Job Failure: Timeout Issue and Job Restarted Unexpectedly

0

I am using the batch transform function in SageMaker for the inference of my PyTorch model. I am using the same structure as https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/scikit_bring_your_own/container. The error is that my job will start multiple times on different workers if I choose multiple workers. Or it will repeat after finish if I choose 1 worker.

I think it should be some errors in timeout setup. I have tried to increase the keepalive_timeout and proxy_read_timeout in the serve file and tried the SAGEMAKER_MODEL_SERVER_TIMEOUT as an environment variable. But nothing worked. Could some one help? Thanks!

  • To understand the scenario better, can you share the error message and the code used for the setup?

没有答案

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则